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Abstract 


Representation  of  natural  signals  such  as  sounds  and  images  is  critically  impor¬ 
tant  in  a  broad  range  of  fields  such  as  multimedia,  data  communication  and  storage, 
biomedical  imaging,  robotics,  and  computational  neuroscience.  Often  it  is  crucial 
that  the  representation  be  efficient,  i.e.,  the  signals  of  interest  are  encoded  economi¬ 
cally.  It  is  also  desirable  that  the  representation  be  robust  to  various  types  of  noise.  In 
this  thesis,  we  advocate  several  ways  to  expand  current  signal  encoding  approaches 
via  the  framework  of  adaptive  representations. 

In  recent  decades,  the  multiresolution  paradigm  has  provided  powerful  mathe¬ 
matical  and  algorithmic  tools  to  signal  encoding.  In  spite  of  widely  proven  effec¬ 
tiveness,  such  methods  ignore  statistical  structure  of  the  class  of  signals  they  should 
represent.  On  the  other  hand,  high  computational  costs  artificially  confine  standard 
linear  adaptive  statistical  models  to  relatively  small  block-based  encoding  scenarios. 
We  show  that  a  good  tradeoff  between  computational  complexity  and  coding  effi¬ 
ciency  can  be  achieved  via  a  hybrid  encoding  scheme:  Multiresolution  ICA.  When 
applied  to  natural  images  the  new  method  significantly  outperforms  JPEG2000,  the 
current  compression  standard,  which  indicates  adaptivity  as  a  source  of  practical 
improvement  for  modern  coders. 

Sparsely  encoding  large  signals  via  a  set  of  adaptive  variable-size  shiftable  ker¬ 
nels  has  been  studied  in  several  contexts,  like  efficient  auditory  coding.  One  impor¬ 
tant  merit  of  this  paradigm  is  that,  besides  efficient  adaptive  coding,  it  also  provides 
a  direct  approach  towards  an  (approximately)  shift-invariant  representation.  This  is 
especially  desirable  in  modeling  encoding  systems  robust  to  signal  shifts,  such  as 
biological  sensory  systems.  We  study  this  problem  in  the  case  of  images  and  pro¬ 
vide  contributions  leading  to  fast  and  superfast  algorithms,  significantly  improving 
the  complexity  of  the  kernel  learning  process. 

The  third  part  of  this  thesis  is  a  mathematical  study  of  Robust  Coding  -  the  prob¬ 
lem  of  optimal  linear  coding  with  limited  precision  units.  We  characterize  optimal 
solutions  in  the  case  of  Gaussian  channel  noise  and  arbitrarily  many  encoding  units, 
and  derive  efficient  and  stable  algorithms  for  their  computation.  By  expressing  the 
limit  of  optimization  as  a  closed-form  bound,  we  provide  a  formal  justification  of 
the  intuition  that  noisy  encoding  units  can  preserve  signal  information  if  sufficiently 
many  are  used  -  a  case  very  relevant  to  modeling  neural  encoding  systems. 


VI 


Acknowledgments 


Going  down  the  Ph.D.  path  is  an  extraordinary  adventure.  Truly  valuable  research  gems 
are  never  easy  to  find,  and  many  monsters  (frustration,  impatience,  procrastination,  despair) 
await  for  the  weak  to  slip  into  their  traps.  I  am  deeply  grateful  to  my  advisor  Mike  Lewicki 
for  inspiring  me  to  discover  nothing  less  than  the  most  precious  stones,  for  subtly  helping 
me  become  more  and  more  independent,  as  well  as  for  trying  to  make  sure  that  I  don’t  get 
scared  along  the  way.  Not  a  typical  C.S.  advisoij1],  he  did  more  than  his  fair  share  as  fearless 
lab  leader.  He  exposed  me  to  fascinating  problems,  suggested  ideas  I  wouldn’t  have  come 
across  in  a  million  years,  and  was  there  when  I  needed  him.  (Also,  every  now  and  then,  he’d 
help  me  up  without  my  even  beginning  to  realize  I  need  to  start  asking.) 

Along  the  way,  I  discovered  several  other  people  which  were  so  kind  to  lend  me  their 
valuable  insight,  their  spectacular  vision,  and  huge  chunks  of  their  precious  time.  Jelena 
“with  the  magic  touch”  Kovacevic  helped  me  discover  entire  research  topics,  pointed  me  to 
breathtaking  papers,  invited  me  to  her  own  lab  meetings,  and  injected  me  with  many  shots 
of  confidence.  Manuel  Blum  overwhelmed  me  with  his  wisdom,  humor,  kindness,  respect 
for  students,  and  appetite  for  elegance  in  thought,  word,  and  deed.  Among  multi-faceted 
computer  scientists,  one  of  the  best  is  Gary  Miller;  he  offered  competent  advice  to  me  on 
several  dozen  different  topics,  always  being  a  step  ahead  and  always  suggesting  yet  another 
really  cool  idea  to  try.  Probably  the  best  faculty  role  model  for  me  has  been  Markus  Piischel; 
I  have  never  enjoyed  any  other  course  in  my  entire  “career”  more  than  I  did  his  Algebraic 
Signal  Processing  Theory,  and  few  times  did  I  have  so  much  fun  (and  apparent  ease!)  when 
working  on  a  research  problem  as  I  had  during  our  collaboration.  Danke  schon! 

When  considering  getting  a  Ph.D.,  it  is  extremely  useful  to  have  someone  lay  out  a  solid 
plan  for  what  you  should  expect  ahead  (and  how  to  handle  it);  in  my  case,  that  someone  is 
George  Necula.  Without  his  generous  and  wise  advice,  there’s  no  telling  where  I’d  gotten. 
A  great  contribution  to  my  “A  Ph.D.  is  not  enough!”  wake-up  call  was  that  of  Justinian 
Rosea,  my  internship  supervisor,  who  recommended  the  book  with  such  self-explanatory 
title  and  offered  his  valuable  assistance  every  time,  without  hesitation.  I  would  probably  be 
doing  something  completely  different  today  without  the  inspiration  and  tremendous  support 
of  Prof.  Luminita  State,  my  advisor  at  the  University  of  Bucharest;  her  enthusiasm  and 
charisma  left  me  helplessly  in  love  with  Artificial  Intelligence  and  so  much  more. 

But  to  get  to  know  all  of  the  people  above,  something  else  had  to  have  happened  first: 
meeting  the  greatest  couple  of  Math  teachers  on  the  face  of  the  Earth,  Dana  and  Eugen 
Radu  (aka  na§ii)\  they  knew  exactly  how  to  push  my  button s0  so  that  I’ll  always  love  and 
appreciate  the  geometrical  beauty  of  thinking  clearly.  Each  of  them  explained  to  me  more 
than  just  about  my  ABC’s,  or  even  ABCD’s;  by  their  humor,  warm  decency,  mind-blowing 

Right...  they  never  are! 

An  operation  that  I  now  call  “to  program”. 


short  proofs,  and  (amazingly!)  being  right  all  the  time,  I  learned  that  there’s  nothing  I’d  like 
better  than  to  follow  in  their  footsteps.  I  would  like  to  thank  my  high-school  Rom.-lit.  prof 
Mrs.  Emanuela  Cri§an  for  the  delightful  and  sparky  in-class  arguments,  always  on  the  same 
theme  (in  many  disguises):  what’s  more  important  -  human  reason,  or  divine  revelation?  I 
won  every  time:  the  prize  I  took  away  were  her  always  beautiful  and  uplifting  thoughts. 

All  members  of  CPLabj]  at  CMU  (aka  MikeLab)  contributed  in  unique  ways  to  main¬ 
taining  a  friendly  research  environment.  Our  lab  meetings  lead  us  to  most  interesting  dis¬ 
cussions,  as  well  as  to  new  and  creative  ways  to  explain  our  ideas.  I  owe  a  lot  to  the  super- 
amazing  Yan  Karklin,  the  elegant  Evan  Smith,  the  sunny  Sofia  Cavaco,  the  playful  Jing 
Chen,  the  conscientious  Daniel  Leeds,  and  the  awesome  Woo- Young  Lee.  Determinant  for 
me  was  the  company  of  Eizaburo  Doi;  I  am  grateful  to  him  for  introducing  me  to  Robust 
Coding  and  for  having  the  patience  of  smoothing  the  heck  out  of  my  roughest  rough  edges 
(research- wise).  Arigato!  Another  extraordinary  office  mate  was  Dr.[]  Andrew  Gilpin,  who 
semi-supervisedly  learned  Romanian  to  become  an  even  better  office  mate  (also,  to  under¬ 
stand  what  was  going  on  in  the  office  behind  his  back).  I’ll  never  forget  his  first  Ce  faci ?j] 
uttered  at  me,  or  Ma  dau  jos  la  tine!^  playfully  addressed  to  our  other  (more  restless)  Ro¬ 
manian  office  mate.  He  even  got  to  use  this  knowledge  outside  of  the  office,  in  his  visit  to 
Romania,  amazing  and  flattering  everybody  (particularly  my  folks).  Mulfumesc! 

I  know  now  that  life  in  academia  is  more  than  “publish  or  perish”.  Actually,  it’s  more 
like  “appreciate  all  your  collaborators  and  co-authors  for  all  they  do  to  make  it  easier  and 
fun  for  you  to  sit  around  and  look  smart...  or  perish”.  At  CMU,  I  was  blessed  to  have 
met  an  amazing  group  of  fellow  students  who  influenced  much  of  what  I  know,  and  how  I 
(should)  think.  I  will  only  pick  two  out  of  this  remarkable  crowd.  It’s  hard  to  describe  how 
much  fun  it  is  to  think  about  the  most  impossible  of  problems  when  Gowri  Srinivasa  is  in 
the  house.  Compared  to  that,  even  her  generous  and  cheerful  help  giving,  or  contagiously 
enthusiastic  energy  seem  to  fall  short  (although  not  by  much).  Her  new  students  might  not 
know  yet  under  what  a  lucky  star  they  have  been  gathering  (but  for  sure  they’ll  find  out!).  An 
extraordinary  guy  is  my  fellow  Eastern  European  co-author  Aliaksei  Sandryhaila.  Dynamic 
and  determined,  friendly  and  brilliant,  practical  and  funny,  he  often  seems  like  nothing  can 
stop  him.  So  far  nothing  has,  and  I’m  willing  to  bet  that  nothing  will.  A  great  deal  of  my 
gratitude  and  appreciation  goes  to  Justin  Romberg  and  Nick  O’Donoughue  for  promptly  and 
graciously  agreeing  to  help  out  when  I  was  recently  unable  to  physically  present  two  of  my 
posters  (any  grad  student’s  nightmare). 

The  work  in  this  thesis  would  have  not  been  possible  without  the  care  and  dedication  of 
competent  and  nurturing  staff.  Thank  you,  Sharon  Burks  and  Deb  Cavlovich,  for  the  many 
last-minute  support  letters  and  for  everything  else  in  between!  As  for  Ms.  Anna  Hegedus, 
the  “know-it-all  supreme”  title  in  computer...  everything,  is  not  so  far  from  the  truth. 

In  Pittsburgh,  you  are  fortunate  if  you  have  around  many  Romanians  like  the  ones  I  had. 
From  them  I  learned  that  a  hearty  community  can  be  stronger  than  a  country.  Here  is  to 
Alina  Oprea,  Radu  Niculescu,  and  Cristina  Canepa.  Special  thanks  go  to  Florin  Oprea  for 
always  being  Mr.  Helpful,  Mr.  Available,  Mr.  Reliable,  and  generally  the  best  soccer  player 
I’ve  ever  aspired  to  be.  Speaking  of  which,  I’d  like  to  thank  all  my  soccer  team  mates  whose 

Laboratory  for  Computational  Perception  and  Statistical  Learning 

At  the  moment  of  this  writing,  the  newest  Doctor  around! 

“How  are  you  doing?”  (Rom). 

“I’m  coming  down  to  get  you!”;  as  in,  “If  you’re  not  good...”  (Rom). 


efforts  over  the  years  inspired  me  to  get  serious  about  winning  the  2008  CMU  Intermediate 
Intramural  Indoor  Tournament:  Leonid,  Aaron,  Vince,  David,  Daniel,  Lucian,  Rob,  Will, 
and  the  other  thoroughbred  “Real  Mellons”. 

In  the  past  few  years,  my  old  envy  of  friends  with  siblings  got  cured;  the  long  wait 
finally  paid  off  big  time  due  to  my  ever-so-studious  brother-in-law  Marius  and  to  my  smart, 
beautiful,  and  witty  sister-in-law  Aliona.  (Now  others  can  envy  me!)  They  -  in  fact  all  my 
family  and  friends  back  home  -  deserve  my  sincere  apologies  for  my  not  writing  or  calling 
as  often  as  I  had  liked,  as  well  as  my  gratitude  for  their  going  ahead  and  asking  what’s  up, 
how  much  longer  is  the  PhD  supposed  to  take,  or  when  is  that  (ever  elusive!)  next  visit  home 
likely  to  happen  -  so  that  they  can  properly  plan  and  be  completely  available  when  it  comes 
to  satisfying  my  every  whim. 

Last,  but  not  least,  I  would  like  to  thank  my  parents,  Marieta  and  Nelu,  for  all  the  uncon¬ 
ditional  love  and  support  they  offered  me  throughout  my  life.  My  mother  encouraged  me  to 
explore  more  of  this  world  so  that  I  could  know  more  about  who  I  am  and  who  I  should  want 
to  become.  She  is  also  the  first  person  in  my  family  with  a  Ph.D.  From  my  father  I  learned 
how  important  it  is  to  be  strong,  stubborn,  patient,  pragmatic,  but  relaxed  and  (at  least  from 
time  to  time)  extremely  funny.  Our  fishing  trips  are  excellent  lessons  about  how  (mostly  his) 
preparation  can  meet  (sometimes  my)  opportunity;  if  only  I  had  realized  earlier  that  this  is 
what  “fisherman’s  luck”  is  all  about. 


Dr.  Nina  Balcan  is  beyond  any  acknowledgment.  If  it  were  only  for  her  moral  contri¬ 
bution  to  this  thesis,  her  name  should  be  engraved  in  golden  capital  letters  on  every  pageQ. 
Ever  since  I  met  her,  she  is  constantly  showing  me  by  the  power  of  personal  example  how  to 
be  better;  how  to  constantly  want  more;  how  to  get  more;  how  willpower  beats  all  adverse 
odds  and  prejudice  eventually;  how  respecting  and  loving  others  does  not  exclude  always 
demanding  nothing  less  than  the  best  from  them;  how  to  not  forget  anything  good  (or  bad!). 
She’s  my  twenty-four-hours-a-day  advisor,  best  friend,  hero,  favorite  researcher,  and  loving 
wife.  And  I  feel  that’s  only  the  beginning... 


7I  am  not  sure,  but  I  suspect  this  goes  against  the  department’s  policy,  or  dissertation  guidelines,  or  both. 


IX 


X 


Contents 


1  Introduction 


1.1 

Motivation  .  . 

1.2 

Thesis  outline 

2  Background 

|2. 1  Bases,  Frames,  and  Dictionaries . 

[2.2  Adaptive  Models:  ICA  and  Robust  Coding 

[2.3  Sparse  Approximations . 

[2.4  Shiftable  Kernel  Representations. . 


3  Multiresolution  ICA 

[3.1  Introduction . 

[3.2  Adaptive  Multiresolution  Coding 
|3.3  Complexity  Issues  of  MrIC A  .  . 

[3.4  Experimental  Results . 

|3.5  Concluding  Remarks . 


X  Point  Coding 

|4. 1  Introduction . 

[4.2  The  Point  Coding  Problem 

[4.3  Matching  Pursuit|  .  .  .  .  . 

|4.4  Dictionary  Update . 

|4.5  Experimental  Results  .  .  . 
|4.6  Conclusion  . 


5  Robust  Coding 

[5 . 1  Introduction . 

|5.2  Robust  Coding  Solutions  . 
|5.3  Robust  Coding  Algorithms 

|5.4  Discussion . 

|5.5  Appendix . 


S  Conclusions 


[Bibliography 


1 

1 

2 

5 

5 

7 

10 

12 

15 

15 

17 

19 

20 

22 

27 

27 

28 

29 

31 

32 

35 

37 

37 

39 

49 

50 

51 

59 

63 


xi 


xii 


List  of  Figures 


3.2  Basis  functions  computed  by  MrICA  with  1  MR  decomposition  level  for  32x32  log-scale 
natural  images  (see  text).  For  each  subband,  a  random  set  of  basis  functions  are  displayed.  20 

3.3  Layout  of  the  parameters  corresponding  to  MrICA  basis  functions,  in  the  Spatial  fre- 

ion  (angular)  domain.  The  black  circles  represent  the  pa- 
lctions  computed  for  the  approximation  subband.  Colored 
ins  from  the  intermediate  detailed  subbands  (red=horizontal, 
il).  Colored  dots  represent  basis  functions  from  the  highest 


3.4  Relative  rate-distortion  performance  of  three  methods  (MrICA,  wavelet,  JASPER)  com¬ 
puted  for  the  32  x  32  test  images . 24 

3.5  Examples  of  32  x  32  images  reconstructed  at  25dB.  Column  1. Original  images;  2.1mage 

encoded  by  non-adaptive  method;  3.Error;  4.1mage  encoded  by  MrICA  ;  5.Error . 25 

3.6  Examples  of  64  x  64  images  reconstructed  at  20dB.  Column  1. Original  images;  2.1mage 

encoded  by  non-adaptive  method;  3.Error;  4.1mage  encoded  by  MrICA  ;  5.Error . 26 


quency  (radial)  vs.  Orientat 
rameters  of  MrICA  basis  fu 
circles  represent  basis  functic 
green=vertical,  bluc=diagon; 


resolution  detailed  subbands. 


|4. 1  A  graphical  depiction  of  the  data  organization  for  the  efficient  Matching  Pursuit  algorithm. 


4.2  Results  of  applying  the  Point  Coding  method  to  natural  images  in  the  Kyoto  database. 
The  dictionary  was  initialized  with  K  =  25  random  kernels  of  size  10  x  10.  Kernels  are 
up-scaled  for  a  better  visualization;  actual  pixel  size  is  displayed  above  each  kernel  subplot. 

4.3  Results  of  applying  the  Point  Coding  method  to  newspaper  images.  The  dictionary  was 
initialized  with  K  =  40  random  kernels  (only  9  are  shown)  of  size  8x8.  Kernels  are 


4.4 


up-scaled  for  a  better  visualization;  actual  pixel  size  is  displayed  above  each  kernel  subplot. 
Results  of  applying  the  Point  Coding  method  to  fingerprint  images.  The  dictionary  was 


initialized  with  K  =  25  random  kernels  of  size  10  x  10.  Kernels  are  u  vscaled  for  a  better 


visualization;  actual  pixel  size  is  displayed  above  each  kernel  subplot. 


30 

33 

33 

34 


percent 
ation.  (c) 
precision 
(M=512) 
l  (M=64) 


5.2  Robust  Coding  solutions:  computed  without  additional  constraints  (left),  with  ’’sparsity” 

constraints  (center),  and  with  “locality”  constraints  (right).  [Courtesy  of  E.  Doi.]  . 49 


5.1  Image  coding  under  the  presence  of  channel  noise.  For  each  reconstruction  it 
error  is  indicated,  (a)  Original  image,  (b)  PCA  (M=32)  with  noiseless  represent 
PCA  (M=32)  with  1-bit  precision  code,  (d)  Robust  coding  (M=32)  with  1-bit 
code,  (e)  Robust  coding  (M=64)  with  1-bit  precision  code,  (f)  Robust  coding 
with  1-bit  precision  code,  (g)  PCA  (M=64)  with  1-bit  precision  code,  (h)  IC/ 
with  1-bit  precision  code,  (i)  Daubechies  9/7  wavelet  with  1-bit  precision  code. 
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Chapter  1 
Introduction 


In  everyday  life,  a  change  of  perspective  about  a  particular  problem  that  we  confront  is  likely 
to  reveal  entirely  new  aspects  which  both  enrich  our  understanding  of  the  problem  and  improve 
the  way  we  solve  it.  Sometimes,  a  different  perspective  might  be  even  critical  in  discovering  the 
most  efficient  solution.  This  high-level  principle  lies  at  the  foundation  of  science  and  discovery 
in  general,  but  in  signal  processing  it  has  a  concrete,  low-level  analog:  the  choice  of  signal 
representation. 


1.1  Motivation 

Deriving  efficient  representations  of  natural  signals  is  an  important  and  challenging  research 
topic.  There  are  multiple  ways  to  quantify  progress,  many  of  them  employed  as  the  “working 
standard”  by  an  entire  beneficiary  community.  For  instance,  to  multimedia  users,  having  better 
signal  representations  means  more  music  and  video  on  their  portable  devices,  and  consequently 
more  entertainment.  To  robot  manufacturers,  it  translates  into  a  better  chance  for  a  robot  to 
navigate  and  operate  within  a  new  environment.  Diverse  applications  in  communications,  earth 
sciences,  and  medicine  can  greatly  benefit  from  representations  with  good  descriptive  and  com¬ 
putational  properties. 

Besides  the  attraction  towards  practical  applications,  often  associated  with  financial  gratifi¬ 
cation,  there  exists  the  human  drive  to  understand  nature.  One  of  the  greatest  challenges  posed  to 
science  has  been  to  explain  the  function  of  the  brain:  what  are  the  principles  that  govern  the  phe¬ 
nomena  taking  place  here?  There  is  still  much  to  be  learned  about  representing  and  processing 
information  in  the  brain,  even  in  relatively  specialized  subsystems.  For  instance,  by  observation 
and  experimentation  we  learned  that  the  visual  system  has  the  role  of  analyzing  and  combining 
the  various  information  content  of  the  perceived  images,  as  it  is  transmitted  from  the  retina  all 
the  way  to  the  cortex.  This  is  much  more  than  to  merely  format  the  visual  stimulus  into  spikes 
and  distribute  it  to  the  processing  areas;  it  also  involves  “splitting”  complex  scenes  into  features 
that  later  combine  into  higher-level  concepts. 

Although  the  exact  mechanisms  and  computational  principles  driving  these  processes  are  not 
fully  understood  (or  maybe  because  of  it),  the  brain  is  rightfully  considered  the  ultimate  high- 
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end  signal  processor.  It  outperforms  by  far  any  known  artificial  system  at  tasks  that  involve 
abstract  concept  manipulation  -  analyzing  complex  scenes,  navigating  unknown  environments, 
extracting  specialized  types  of  features  like  text,  etc.  It  is  then  perhaps  not  surprising  that  several 
security  protocols  are  heavily  relying  on  this  (see,  for  instance  [|1 1 8|] ) ! 

In  this  thesis,  we  investigate  several  aspects  related  to  natural  signal  representation  while  fo¬ 
cusing  on  one  main  class  of  applications:  visual  signal  encoding.  We  address  several  problems 
generated  by  existing  signal  representation  frameworks  and  propose  novel  extensions  by  em¬ 
bracing  the  adaptive  encoding  point  of  view.  For  each  of  the  instances  hereby  studied,  we  follow 
two  main  goals:  to  clearly  identify  the  theoretical  principles  which  govern  the  representation’s 
optimality,  and  to  identify  the  most  efficient  algorithm  to  compute  it. 

1.2  Thesis  outline 

The  thesis  is  organized  as  follows. 

•  Background.  In  Chapter  |2j,  we  review  most  of  the  fundamental  notions  and  theoretical 
concepts  used  in  the  remainder  of  the  thesis.  We  start  by  introducing  basic  signal  pro¬ 
cessing  notions  such  as  bases,  frames,  and  dictionaries,  then  continue  by  presenting  the 
concept  of  multiresolution  and  several  signal  representations  of  that  family.  Next,  we  ad¬ 
dress  the  issue  of  adaptivity  and  illustrate  it  with  two  finite-dimensional  linear  models: 
ICA  and  Robust  Coding.  After  a  short  introduction  to  sparse  signal  approximations  with  a 
particular  emphasis  on  greedy  encoding  methods,  we  explain  the  concept  of  shiftable  ker¬ 
nel  dictionary  representations,  and  illustrate  a  way  to  obtain  adaptive  dictionaries  of  this 
type. 

•  Multiresolution  ICA.[]  In  Chapter  |3|,  we  study  the  problem  of  efficient  and  adaptive  repre¬ 
sentation  of  large-scale  images.  In  recent  decades,  the  multiresolution  paradigm  has  pro¬ 
vided  powerful  mathematical  and  algorithmic  tools  to  signal  encoding.  In  spite  of  widely 
proven  effectiveness,  such  methods  ignore  the  statistical  structure  of  the  class  of  signals 
they  represent.  On  the  other  hand,  because  of  usually  high  computational  costs,  standard 
linear  adaptive  statistical  models  have  been  confined  artificially  to  relatively  small  block- 
based  encoding  scenarios.  We  show  that  a  good  tradeoff  between  computational  cost  and 
coding  efficiency  can  be  achieved  via  a  hybrid  encoding  scheme:  Multiresolution  ICA. 
When  applied  to  natural  images  the  new  method  significantly  outperforms  JPEG2000,  the 
current  compression  standard,  which  indicates  adaptivity  as  a  source  of  practical  improve¬ 
ment  for  modern  coders. 

•  Point  Coding^  In  Chapter^,  we  review  the  problem  of  sparsely  encoding  large  signals  via 
a  set  of  adaptive  variable-size  shiftable  kernels.  An  important  merit  of  this  approach  is  that 
it  produces  a  very  efficient  adaptive  code,  which  is  explained  by  the  flexibility  of  such  a 

’Parts  of  this  chapter  have  been  published  in  |[TT|] . 

2The  work  in  this  chapter  has  been  published  in  |[TJ|. 
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dictionary,  compared  with  conventional  representations,  such  as  wavelets.  In  addition,  this 
provides  a  more  direct  way  to  obtain  an  (approximately)  shift-invariant  signal  representa¬ 
tion.  This  is  especially  desirable  in  modeling  encoding  systems  robust  to  signal  shifts,  such 
as  biological  sensory  systems.  We  study  this  problem  in  the  case  of  images  and  provide 
contributions  leading  to  efficient  algorithms,  significantly  improving  the  complexity  of  the 
kernel  learning  process.  Specifically,  we  show  that  we  can  employ  fast  and  superfast  algo¬ 
rithms  for  the  learning  step,  by  formulating  the  problem  as  a  least-squares  problem  with  a 
highly  structured  (Toeplitz-mosaic)  matrix.  Encoding  is  implemented  via  a  fast  version  of 
Matching  Pursuit  based  on  exploiting  the  relatively  small  sizes  of  the  kernels  compared  to 
the  signal  and  by  using  appropriately  designed  data  structures  to  speed  up  computations. 

•  Robust  Coding.f]  In  Chapter  |5]  we  provide  a  detailed  mathematical  study  of  Robust  Coding 
-  the  problem  of  optimal  linear  coding  with  limited  precision  units.  We  characterize  opti¬ 
mal  solutions  in  the  case  of  Gaussian  channel  noise  and  arbitrarily  many  encoding  units, 
and  derive  efficient  and  stable  algorithms  for  their  computation.  By  conveniently  express¬ 
ing  the  limit  of  optimization  as  the  closed-form  bound,  we  formally  explain  the  intuition 
that  noisy  encoding  units  can  preserve  signal  information  if  sufficiently  many  are  used  -  a 
case  very  relevant  to  modeling  neural  encoding  systems. 

•  Conclusions.  Chapter  [|  contains  a  summary  of  this  thesis,  together  with  a  list  of  directions 
and  ideas  we  intend  to  pursue  in  the  future. 

In  completing  the  research  projects  reported  here,  we  employed  the  principles  of  learning  and 
optimization  to  the  design  of  signal  representations.  In  each  case,  the  first  step  was  to  identify 
some  clear  computational  objective  and  express  it  in  the  most  convenient  mathematical  form. 
The  next  step  was  to  choose  the  algorithm  most  suitable  to  exploit  the  intrinsic  structure  of  the 
problem.  Finally,  by  implementation  and  simulation,  hypotheses  were  validated  and  often  new 
insights  appeared. 

For  the  sake  of  coherence,  we  decided  not  to  include  in  the  body  of  this  thesis  several  pub¬ 
lished  results,  which  are  nevertheless  useful  in  exploring  and  understanding  various  aspects  of 
signal  representation  design.  One  significant  direction  is  artificial  bandwidth  extension  of  speech 
signals  (see  [|9],  |103|]).  There  we  consider  the  practical  problem  of  enhancing  speech  whose  TF 
content  is  missing,  either  as  a  result  of  bandpass  filtering  (like  in  telephony)  or  as  a  by-product 
of  certain  source  separation  algorithms.  We  address  the  problem  of  “filling  the  spectral  holes”  a 
case  of  statistical  estimation  with  missing  data.  A  second  research  direction  we  mention  is  the 
problem  of  designing  polynomial  signal  transforms  asymptotically  approximating  the  Discrete- 
Time  Fourier  Transform,  but  which  do  not  require  the  periodicity  assumption  usually  associated 
with  DFT  (see  [p~Q|]).  To  answer  this  question,  arising  from  the  general  algebraic  signal  process¬ 
ing  theory  [p9]|.  we  identify  a  fairly  large  class  of  such  finite  polynomial  transforms  by  defining 
polynomial  families  whose  set  of  roots  approximately  converges  to  the  complex  unit  circle  (with 
perhaps  finitely  many  exceptions). 

3  Parts  of  this  chapter  have  been  published  in  [[bj.  [44]] . 
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Chapter  2 
Background 


The  focus  of  this  thesis  is  the  study  of  signal  representations.  We  investigate  several  ways  in 
which  existing  approaches  can  be  extended  and  improved,  but  first  we  need  to  lay  out  the  math¬ 
ematical  groundwork  of  our  construction. 

To  start,  let  us  specify  that  we  only  consider  discrete-time/space  signals.  Although  several  of 
the  problems  we  treat  in  the  following  do  not  necessarily  require  it,  for  simplicity  we  restrict  to 
finite  signals.  Accordingly,  a  signal  shall  usually  be  regarded  as  a  vector  x  in  a  finite-dimensional 
space  V  ( e.g .,  K'V  or  C/Y).  Its  representation  shall  be  defined  with  respect  to  dictionary i  <I>,  a  finite 
subset  of  V,  as  a  vector  s  such  that 

x  =  <3>s.  (2.1) 

Assuming  the  dictionary  has  M  elements,  the  representation  is  thus  a  M-dimensional  vector  of 
coefficients.  Computing  the  signal  from  the  representation  (when  the  dictionary  is  fixed)  is  called 
decoding  or  synthesis,  and  by  definition  is  a  simple  linear  operation.  The  reverse  operation 
(< encoding ,  or  analysis )  may  be  a  more  complicated  process  is;  for  example,  there  may  exist 
infinitely  many  vectors  s  (or  even  none  at  all!)  that  satisfy  eq.  pH]. 

A  central  issue  of  this  thesis  is  adaptivity.  When  the  dictionary  is  optimized  in  some  respect 
to  represent  signals  of  a  given  class,  i.e.,  it  reflects  either  deterministic  or  statistical  properties  of 
the  class,  we  say  it  is  adapted  to  the  class  and  the  induced  representation  shall  be  called  adaptive', 
otherwise,  we  label  it  as  fixed. 


2.1  Bases,  Frames,  and  Dictionaries 

The  canonical,  sample-based  representation  is  often  not  appropriate  to  describe  compactly  the 
complicated  structure  of  many  classes  of  signals.  The  most  common  source  of  redundancy  is 
the  strong  local  dependency  between  samples;  in  the  case  of  images,  this  would  correspond 
to  neighboring  pixels  having  similar  values,  due  to  similar  color  or  intensity.  Other  types  of 
regularities,  such  as  texture  patterns,  and  even  “frequent”  irregularities  or  other  discontinuities, 
like  edges,  could  be  handled  more  efficiently  by  alternative  representations. 

The  idea  of  choosing  the  appropriate  signal  representation  for  the  task  at  hand  is  generating 


5 


much  of  the  effort  (and  the  progress)  in  signal  processing  and  many  results  from  linear  algebra 
have  contributed  to  great  advancements  in  the  field.  Essentially,  searching  for  linear  representa¬ 
tions  of  finite  signals  is  searching  for  dictionary  matrices  $  whose  columns  span  the  entire  vector 
space  V.  If  the  columns  of  such  a  matrix  form  a  spanning  set  for  the  whole  space  and  they  are 
linearly  independent,  we  say  that  the  representation  is  complete]].  In  case  the  vectors  lack  linear 
independence  but  still  span  the  space,  the  representation  is  called  overcompleteQ.  Otherwise,  we 
shall  denote  the  representation  as  undcrcompletcf]. 

The  ideas  about  changing  coordinates  to  analyze  structure  go  a  long  way  back,  originating 
in  Physics,  but  spectral  methods  and  their  applications  to  signal  processing  have  flourished  after 
the  (re)discovery  of  the  fast  algorithms  for  their  implementation.  Unfortunately,  the  Discrete 
Fourier  Transform,  the  many  kinds  of  Discrete  Trigonometric  Transforms,  or  any  linear  basis 
of  C;V  with  only  global  structure  for  that  matter,  do  not  offer  a  compact  description  of  signals 
with  local  spatial  structure,  which  is  a  consequence  of  the  uncertainty  principle.  The  need  for 
more  specialized  descriptors  was  behind  the  multiscale  revolution,  and  the  wavelet  frenzy.  The 
representational  advantage  and  the  low  computational  cost  of  applying  the  Discrete  Wavelet 
Transform  have  lead  to  the  design  of  current  image  coding  standards  such  as  JPEG2000  [|109H 
(also  see  ®[H7|]). 

The  various  applications  often  require  representations  having  specific  analytical  and  compu¬ 
tational  properties.  Generally,  the  process  of  computing  such  bases  requires  an  objective  function 
and  a  set  of  constraints.  In  come  cases,  the  solutions  of  these  optimization  problems  are  unique, 
while  in  others  they  are  not;  moreover,  it  is  also  possible  that  the  optimization  problem  be  over¬ 
constrained  and  that  no  solution  exists.  For  example,  in  frame  design  the  objective  is  to  find  an 
overcomplete  basis  with  specified  properties,  such  as  prescribed  length  columns  and  minimum 
inner  product  between  different  column  vectors.  If  we  throw  in  additional  constraints,  for  exam¬ 
ple  those  regarding  the  set  of  singular  values  of  the  matrix,  we  may  get  a  rather  difficult  problem 
to  solve.  Luckily,  some  of  these  problems  can  be  handled  numerically  by  the  use  of  efficient  op¬ 
timization  algorithms  on  structured  spaces,  as  opposed  to  having  closed-form,  analytic  solutions. 
Depending  on  the  context,  we  will  be  satisfied  with  such  a  particular  numerical  solutions,  or  we 
shall  search  for  specifications  of  the  whole  space  of  solutions.  In  this  thesis,  we  will  pursue  both 
possibilities,  and  specify  the  advantages  for  adopting  each  point  of  view. 

In  certain  cases,  it  is  possible  to  implement  the  process  of  signal  analysis  or  that  of  synthesis 
of  a  without  the  explicit  use  of  a  matrix-vector  product.  This  happens  frequently  when  faster 
algorithms  exist  which  do  not  require  all  the  entries  of  the  basis  matrix.  Just  to  give  an  example, 
computing  the  Discrete  Fourier  Transform  of  a  signal  (either  the  direct  or  the  inverse  one)  is 
nowadays  almost  synonymous  with  using  the  Fast  Fourier  Transform  (FFT)  algorithm  (see  e.g., 
[|56|]).  Other  widely-known  signal  processing  operations  (e.g.,  convolution)  can  benefit  from  this 
aspect.  We  will  point  out  the  distinction  between  explicit  and  implicit  linear  operations  when  we 


1  We  call  the  set  of  vectors,  and  by  extension  the  matrix  itself  -  a  basis.  For  finite  dimensional  vector  spaces, 
basis  matrices  are  always  square. 

2To  describe  such  a  system  of  vectors,  we  will  use  the  term  “frame”.  The  formal  definition  of  a  frame  (see  e.g., 
[imp  is  equivalent  to  this  one  in  the  case  of  finite  sets  of  vectors  spanning  a  finite  dimensional  space. 

3We  will  slightly  abuse  the  term  “basis”  in  this  case,  using  it  even  if  the  vector  set  3?  does  not  span  the  space  V. 
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discuss  implementation  aspects,  otherwise  we  will  keep  this  issue  transparent. 

In  the  following,  we  gradually  introduce  several  strategies  for  obtaining  desirable  represen¬ 
tations  as  well  as  for  computing  the  corresponding  signal  coefficients.  Our  generic  goal  will  be 
efficient  coding,  which  also  can  be  formulated  as  compact  signal  description.  Usually,  this  in¬ 
volves  computing  a  minimal  set  of  nonzero  coefficients;  the  advantage  of  these  so-called  sparse 
representations  is  that  we  only  need  to  store  or  compress  a  small  set  of  numbers,  as  opposed  to 
the  whole  set  of  samples,  for  (approximately)  reconstructing  the  signal.  Then,  we  review  adap¬ 
tive  linear  methods,  whose  goal  is  to  improve  the  descriptive  properties  of  the  dictionary  with 
respect  to  the  observed  data.  Next,  we  describe  the  problem  of  deriving  adaptive  linear  represen¬ 
tations  that  are  robust  to  coefficient  perturbations.  Finally,  we  briefly  review  the  so-called  Spike 
Coding  model  -  a  method  for  sparse  one-dimensional  signal  representation  using  an  adaptive 
shiftable-kernels  dictionary. 


2.2  Adaptive  Models:  ICA  and  Robust  Coding. 


Independent  component  analysis  (ICA)  has  appeared  in  the  signal  processing  community  as  a 
general  method  to  separate  a  number  of  sources,  assumed  mutually  independent,  when  several 
(linear)  combinations  of  these  are  available  [|68|].  The  particular  case  when  the  sources  are  Gaus¬ 
sian  distributed  had  already  been  known  as  principal  component  analysis  (PC A),  but  this  could 
not  handle  and  explain  the  many  examples  of  signal  distributions  that  are  not  proper  Gaussian. 
In  a  relatively  short  time,  the  field  also  was  extended  theoretically  and  equivalence  between  the 
apparently  different  settings  has  been  revealed  (see  @0;  extensive  treatment  of  ICA  also  can 
be  found  in  [j3Q|,  |64|.  [79]]). 

Let  xi,  x2,  •  •  • ,  xm  be  samples  drawn  from  a  distribution  with  pdf  p  over  R'v.  The  goal  of 
ICA  is  to  compute  the  NxN  matrices  W  such  that  vectors  s j  =  Wx,  are  the  realizations  of 
a  random  vector  whose  components  are  as  statistically  independent  as  possible,  according  to 
formal  criteria  described  below.  In  other  words,  we  search  for  the  linear  mapping  allowing  us 
to  best  approximate  the  data  distribution  by  a  product  of  marginals.  Alternatively,  ICA  can  be 
viewed  as  a  method  to  describe  the  data  by  a  linear  combination  of  vectors  (or  “basis  functions”): 


x  =  As 


(2.2) 


such  that  the  components  of  random  vector  s  are  maximally  independent.  Matrix  A  is  called 
the  mixing  matrix,  while  W  is  referred  as  the  demixing  matrix.  A  standard  assumption  is  that 
the  data  has  been  processed  to  have  zero  mean,  and  unit  covariance  (possibly  by  dimensionality 
reduction),  which  implies  that  matrices  A  and  W  are  nonsingular,  and  inverse  to  each  other. 
When  the  underlying  data  distribution  is  Gaussian,  coefficient  independence  is  equivalent  to 
decorrelation,  and  thus,  when  the  coefficients  are  indeed  independent,  the  ICA  and  PCA  bases 
will  coincide.  In  general,  PCA  will  search  for  the  best  orthogonal  basis  to  represent  the  data, 
while  ICA  does  not  have  this  constraint. 

There  are  several  mathematical  objective  functions  associated  with  independence,  and  each 
leads  to  one  formulation  of  the  ICA  problem.  For  example,  one  such  objective  is  to  minimize 
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the  mutual  information  of  the  representation  components.  Specifically,  for  any  invertible  linear 
transformation  s  =  Wx,  the  mutual  information  among  the  components  of  s  is  defined  as: 


I(s1, 


0  =  ^tf(Si)-tf(x)-log|detW| 


(2.3) 


i— 1 


By  an  appropriate  scaling  of  W,  mutual  information  can  be  viewed  as  the  difference  between  the 
sum  of  the  marginal  entropies  and  the  joint  entropy.  This  can  be  interpreted  further  in  terms  of 
the  Kullback-Leibler  divergence  between  the  joint  probability  and  the  product  of  marginals,  or  in 
terms  of  the  negentropy  of  the  projections,  or  in  terms  of  data  likelihood  [|3T],  [64]] .  For  instance, 
if  we  denote  the  pdf’s  of  the  projections  by  Pi(-),  the  expectation  of  the  log-likelihood  can  be 
written  asj] 

— 6{log  LCW)}  =  6{log  pj(wj-x)}  +  log  I  det  W|  (2.4) 

m  t--* 

i= i 

which  is  the  negative  of  mutual  information,  except  for  a  constant  term  (entropy  of  the  data). 
Equivalently,  this  can  be  viewed  as  maximizing  various  measures  of  non-Gaussianity  (e.g.,  the 
kurtosis)  of  the  entire  ensemble  and  many  practical  ICA  algorithms  are  based  on  methods  that 
attempt  to  maximize  higher  order  moments  of  the  coefficients.  Finally,  another  direction  to 
approach  independence  of  the  coefficients  is  by  diagonalization  of  certain  matrix  functionals. 
Methods  of  this  family  thus  translate  independence  into  simultaneously  solving  a  series  of  eigen- 
problems. 

Generally,  existing  ICA  optimization  algorithms  can  be  grouped  into  several  categories: 
gradient-based  [|2],  [D|,  |23|,  [6f].  [SI]],  hxed-point[]  [^3],  |1 19Q,  joint-diagonalization  |[24[.  [25||. 

A  very  efficient  procedure  based  on  Relative  Trust-Region  Optimization  has  been  presented  re¬ 


cently  in  Q29Q;  due  to  its  excellent  behavior,  we  decided  to  employ  this  algorithm  for  all  the 
ICA-related  experiments  presented  in  this  thesis. 

To  represent  images  in  the  ICA  model,  we  can  regard  them  as  samples  drawn  from  an  un¬ 
known  distribution,  over  a  linear  space  (the  so-called  image  space).  The  fact  that  independent 
components  of  natural  images  have  a  very  sparse  (or  “thorny”)  distribution  makes  ICA  highly 
suitable  for  image  representation  and  coding.  The  immediate  consequence  of  a  sparse  marginal 
distribution  is  a  low  entropy,  and  thus  the  possibility  of  achieving  a  short  average  code  length, 
which  ultimately  yields  better  compression.  In  fact,  among  all  linear  models  the  ICA  represen¬ 
tation  is  optimal  in  the  entropy  minimization  sense,  which  thus  recommends  it  for  compression 
tasks. 

A  well  known  limitation  of  ICA  is  its  poor  scaling  behavior.  Due  to  the  relatively  high  com¬ 
putational  cost,  it  cannot  be  applied  directly  to  large  dimensional  data.  For  example,  in  case 
of  dxd  images  computing  a  complete  basis  to  span  the  space  implies  estimating  d4  parameters. 
Even  for  a  moderate  value  of  d  (say  100)  the  memory  requirements  are  tremendous.  More¬ 
over,  the  computational  cost  of  typical  gradient  optimization  is  Q(<f21og2 '),  which  is  prohibitive. 


4Here  w,  is  the  uh  row  vector  of  matrix  W. 

2  Although  it  has  been  proven  that  some  of  these  algorithms  were  wrongly  classified  as  "fixed-point”  [  1 19],  for 
convention  purposes  we  chose  to  leave  them  in  this  category. 
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Figure  2.1:  Diagram  of  the  Robust  Coding  Model. 


Therefore,  this  approach  is  feasible  only  if  the  problem  is  reduced  to  a  smaller  space  (as  in  PCA), 
which  in  turn  limits  the  signal  structure  we  are  able  to  represent.  From  the  image  coding  point 
of  view,  the  quantization  or  sparsification  of  the  coefficients  in  this  block  transform  approach 
leads  to  blocking  artifacts  in  the  reconstruction.  Block  processing  also  leads  to  artifacts  for  more 
general  signal  processing  algorithms,  such  as  image  denoising.  In  chapter  [3],  we  shall  present 
a  method  to  compute  an  quasi-ICA  basis  for  large  images  by  solving  several  (typically  much 
smaller)  ICA  problems. 


Optimizing  for  coding  efficiency  is  a  very  desirable  goal  in  applications  such  as  storage  and 
communication.  An  equally  important  aspect  is  the  resilience  of  the  signal  representation  to 
various  types  of  noise.  Reliable  communication  over  noisy  channels  is  the  most  fundamental 
problem  of  information  theory.  Out  of  the  many  variations  on  this  theme,  we  shall  focus  on  the 
problem  of  finding  finite-dimensional  linear  representations  that  optimally  preserve  information 
in  the  transmitted  signals  when  the  representation  has  limited  precision.  Proposed  by  E.  Doi 
et.al.  in  |[43l|  and  further  analyzed  in  [[44|| .  the  Robust  Coding  scheme  uses  arbitrarily  many 
coding  units  to  minimize  reconstruction  error,  by  explicitly  introducing  redundancy  in  the  code 
to  compensate  for  channel  noise. 

The  above  problem  was  pointed  out  to  be  of  particular  relevance  to  the  mathematical  mod¬ 
eling  of  neural  representations.  This  is  not  at  all  surprising;  cells  can  be  regarded  as  communi¬ 
cation  channels  for  the  traveling  neural  spikes,  and  their  coding  precision  is  limited  by  intrinsic 
biological  constraints  to  as  low  as  1-2  bits  per  spike  (see  m,m  and  references  therein).  By 
identifying  the  short  time  activity  of  a  neuron  with  a  real  value,  the  limited  information  capacity 
of  the  encoding  unit  can  be  modeled  effectively  by  additive  Gaussian  noise. 

To  describe  the  problem  formally,  let  us  consider  our  signals  as  samples  drawn  from  an  N- 
dimensional  zero-mean  data  distribution,  with  known  full-rank  covariance  matrix  £x.  We  shall 
search  for  analysis  matrix  W  e  MMx  ;V  and  synthesis  matrix  A  e  K/VxM  that  maximally  reduce 
the  effect  of  additive  Gaussian  noise,  independent  of  the  signal  and  having  the  same  power  erf  on 
each  channel.  If  we  denote  e  =  x  —  x  =  (Ijv  —  AW)  x  —  AS,  our  objective  is  to  minimize  the 
reconstruction  MSE 


< I lel li)x,<5  =  (eTe)  =  tr((eeT))  =  tr  { (((Ijv  —  AW)  x  —  A5)((Ijv  —  AW)  x  —  A<5)T)} 
=  tr  j((Ijv  -  AW)  xxT  (IN  -  AW )T)X  +  (AMrAT),} 

=  tr  {(Ijv  -  AW)Ex(Iat  -  AW)T}  +  aftr  {AAt} 
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In  chapter  |5j  we  shall  provide  a  more  thorough  description  of  the  problem,  as  well  as  a  com¬ 
plete  characterization  of  the  optimal  encoder/decoder  pair. 


2.3  Sparse  Approximations 


Data  compression  is  built  around  the  principle  of  employing  compact  descriptions  of  a  seem¬ 
ingly  complex  signal.  Computing  sparse  linear  representations  of  signals  is  useful  for  many 
applications,  ranging  from  data  communication  to  statistical  data  analysis  and  machine  learning. 

In  general,  it  may  not  be  possible  to  choose  a  small  set  of  nonzero  coefficients  to  represent  any 
signal  exactly  in  a  given  basis.  Besides  an  intrinsic  measure  theoretic  difficulty,  the  limitation 
can  persist  even  if  we  relax  the  exactness  and  settle  for  a  sparse  approximate  coefficient  set. 
The  properties  of  the  dictionary  can  help  significantly  if  they  match  the  statistical  properties 
of  the  signal.  We  will  address  this  issue  in  the  following  subsection,  but  now  it  is  important 
to  focus  on  two  questions.  The  first  is:  how  can  we  compute  the  sparsest  representation  in  a 
given,  fixed  dictionary?  The  second  question  is  concerned  with  the  validation  of  our  answer 
to  the  first  one:  how  close  are  we  to  the  sparsest  representation?  Unfortunately,  in  the  most 
general  case  both  problems  are  NP-hard  (see  [|90|],  as  well  as  [|37|.  [38].  [39|]),  which  means  that 
it  is  unlikely  to  compute  optimal  solutions  in  polynomial  time.  In  spite  of  this  shortcoming,  it 
is  possible  to  obtain  approximate  solutions  in  (pseudo-)polynomial  time.  Next,  we  describe  the 
most  frequently  used  approaches  to  obtain  such  sparse  linear  approximations. 

The  most  straightforward  idea  is  thresholding.  More  precisely,  out  of  a  complete  set  of  coef¬ 
ficients  corresponding  to  a  linear  combination  of  atoms,  we  only  keep  the  ones  with  the  largest 
absolute  values,  while  the  rest  are  assumed  to  be  zero.  Various  strategies  for  determining  the  “ap¬ 
propriate”  number  of  coefficients  were  studied,  and  the  success  of  this  approach  to  applications 
like  compression  and  denoising  was  investigated  in  the  context  of  wavelet  dictionaries. 

The  proper  greedy  approach  to  the  sparsest  set  selection  problem  has  become  known  in  signal 
processing  by  the  name  of  Matching  Pursuit  (MP)  [|87|] .  Unlike  thresholding,  which  assumed  that 
all  the  coefficients  of  a  transform  would  have  to  be  available  in  order  to  choose  the  largest  ones, 
MP  is  a  sequential  algorithm,  allowing  us  to  individually  pick  dictionary  atoms  that  are  most 
correlated  with  the  signal.  Mathematically,  if  we  consider  a  signal  x  and  a  dictionary  made  of 
atoms  (</>i)jg/,  with  /  a  given  set  of  indices,  and  define  i?°x  =  x,  then  at  iteration  step  k  >  1  the 
new  residual  Rkx  is  computed  by  the  rule 


Rk*  =  R ^x  -  sk(j)ik  (2.5) 

where  (f)ik  =  argmax  |<  i?fc_1x,  fa  >1,  and  sk  =<  i?fc_1x,  <f)ik  >.  The  atom  selected  at  each 

i£l 

step  is  orthogonal  to  the  newly  computed  residual  and  thus,  by  Pythagora’s  theorem,  the  energy 
of  the  residual  error  decreases  strictly  which  in  turn  guarantees  asymptotical  convergence.  The 
computational  cost  of  this  procedure  is  in  general  0(MN )  per  iteration,  where  N  the  dimension 
of  x  and  M  is  the  number  of  atoms  in  the  dictionary  (the  bottleneck  lies  in  updating  the  correla¬ 
tion  coefficients  at  each  step).  If  the  inner  products  of  all  pairs  of  atoms  are  available,  then  the 
cost  can  be  reduced  to  0(M).  This  could  be  useful  in  a  setting  where  MP  is  called  for  many 
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signals  x,  using  the  same  dictionary;  for  this  scenario  to  be  practical  the  number  of  calls  should 
be  at  least  M .  If  the  dictionary  is  very  big  however,  this  approach  would  fail  because  of  memory 
constraints. 

One  fundamental  criticism  of  the  Matching  Pursuit  algorithm  is  the  suboptimal  criterion  for 
choosing  each  atom.  Although  at  each  iteration  step  the  atom  most  correlated  with  the  signal 
is  chosen,  the  distance  from  the  current  residual  to  the  linear  span  of  the  atoms  chosen  so  far 
is  not  (as  we  would  expect)  given  by  the  current  coefficients,  unless  the  dictionary  is  an  or¬ 
thogonal  basis.  To  address  this  problem,  Orthogonal  Matching  Pursuit  (OMP)  [|95|]  performs 
an  additional  step  (orthogonalization)  to  insure  that  the  computed  residual  is  orthogonal  to  all 
of  the  already  selected  atoms.  One  effect  is  that  the  OMP  algorithm  is  forced  to  stop  after  a 
number  of  steps  smaller  or  equal  to  the  dimension  of  the  space,  unlike  MP  which  could  poten¬ 
tially  run  infinitely.  Nevertheless,  OMP  uses  the  same  (rather  local)  criterion  for  atom  selection. 
It  was  the  so-called  Optimized  Orthogonal  Matching  Pursuit  (OOMP)  procedure  introduced  in 


Q102|]  which  addressed  this  issue.  Namely,  if  we  denote  by  Vk  the  vector  space  spanned  by  the 


first  k  selected  atoms,  and  for  all  remaining  atoms  ay  we  define^  7 y  = 

OOMP  approach  selects  the  index  ik  =  argmax|  <  yy ,  Rk  1  x  >  |/[  [Oj  I U  1 7?  I  7^  0,  unlike  MP 


Pvkotj ,  then  the 


k- 1 


x  > 


Further  extensions  of  this  method  emerged, 
1U1|]),  Swapping-Based  OMP  (Swap-OOMP  [FI]), 


and  OMP,  which  pick  argmax  |  <7 y,  R 

j 

e.g.,  Backward-Optimized  OMP  (BOOMP 
which  are  heuristics  meant  to  improve  the  chances  of  the  greedy  algorithm  to  approach  the  op¬ 
timally  sparse  solution  (regarding  implementation  issues,  see  also  the  OOMP  tutorial  |[4]]).  The 
computational  price  paid  for  such  refinements  is  sometimes  hard  to  accept  though,  and  for  many 
situations  when  a  faster,  reasonably  sparse  approximation  is  sufficient,  MP  is  the  standard  choice. 

Finally  we  mention  another  approach  which  is  frequently  used  in  approximation  algorithms 
to  tackle  hard  combinatorial  problems,  by  reducing  them  to  the  continuous  domain:  relaxation. 


In  the  case  of  the  sparse  selection  of  dictionary  atoms,  this  was  known  as  Basis  Pursuit  (BP)  [[28]]. 
Specifically,  instead  of  minimizing  the  number  of  nonzero  coefficients  of  the  decomposition  of 
signal  x  into  dictionary  3>,  equivalent  to  solving 


mm 

S 


|  s  1 1 0  s.t.  x  =  <3>s 
we  attempt  to  solve  the  relaxed  variant  in  t\  sense: 


(Po) 


mm 

s 


S.t.  X  =  3>s 


(Pi) 


The  obvious  advantage  of  this  change  of  objective  is  that  now  we  are  dealing  with  a  convex 
problem,  which  can  be  conveniently  formulated  as  a  linear  program.  Less  obviously,  in  cer¬ 
tain  conditions  the  optimal  solutions  of  these  two  seemingly  different  problems  coincide  (see 
the  multitude  of  results  on  this  topic  contained  in  e.g.,  ][28l,  El  n,  mi).  Moreover,  numer¬ 


ical  experiments  have  also  confirmed  the  merits  of  this  approach,  displaying  its  superiority  to 
greedy  algorithms  in  applications  such  as  signal  denoising.  In  spite  of  this  fact,  BP  fails  to  offer 


6By  /yx  we  denote  the  orthogonal  projection  of  x  onto  the  linear  space  V. 
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satisfactory  computational  speed  when  compared  against  Matching  Pursuit,  even  for  relatively 
structured  dictionaries. 

To  conclude,  we  remark  that  computing  sparse  representations  is  desirable,  but  in  general  re¬ 
mains  computationally  challenging.  In  the  following,  we  will  describe  an  instance  of  a  successful 
compromise  between  sparsity  and  speed. 


2.4  Shiftable  Kernel  Representations. 


As  we  argued  previously,  the  need  for  linear  adaptive  signal  representations  is  motivated  by 
many  families  of  computational  procedures  involving  “intelligent”  ways  to  represent  data.  On 
the  other  hand  efficiency  has  been  intuitively  associated  with  sparse  representations.  For  natural 
sounds  and  images,  as  argued  by  Simoncelli  and  coll.  [|5T|],  an  additional  desirable  property  is 
shift-invariance.  Combining  these  different  requirements  has  recently  lead  to  new  sparse  signal 
representations  based  on  adaptive  shiftable-kernel  dictionaries. 

Introduced  and  further  studied  by  Smith  and  Lewicki  [|107|,  |108Q,  Spike  Coding  is  such  a 
method,  which  proved  particularly  successful  in  providing  a  biologically  plausible,  nonpara- 
metric  acoustic  model  for  the  mammal  auditory  system  (see  also  [|106|]).  The  computational 
principles  are  independent  of  the  modality  and  they  apply  for  sounds,  as  well  as  for  images  or 
video.  (For  example,  spike-based  models  were  proposed  by  Perrinet  et.al.  [p6[],  and  Rozell  et.al. 


ma¬ 

in  the  following,  we  will  give  a  basic  mathematical  description  of  the  model  (as  derived  for 

ID  signals),  and  address  several  aspects  regarding  its  implementation.  The  objective  of  Spike 
Coding  is  linear  approximation  of  a  signal  x  E  with  a  set  of  K  shiftable  kernels  <3>  = 
{4>k}l<k<K,  such  that  the  representation  is  as  sparse  as  possible.  Thus,  the  optimization  problem 
can  be  expressed  as 


mm  s  q 


S.t.  ||x(~)  -  y%fc,t0fc,t(-)||2  <  6. 

k,t 


As  this  problem  is  NP-hard  (see  section  4.2),  we  can  only  hope  to  approximate  its  solution. 


One  approach  which  is  computationally  very  attractive,  is  Matching  Pursuit  Q87Q.  Atoms  in  the 
signal  representation  are  obtained  by  selecting  the  shifted  version  of  a  kernel  most  correlated 
with  the  signal,  thus  picking  out  the  single  most  informative  feature  available  at  each  step.  More 
precisely,  if  R°x  =  x,  the  residual  signal  corresponding  to  the  n-th  iteration  is  computed  as: 


/Tx  =  /T'-'x 


(2.6) 


where  4>kn,pn  =  argmax  |<  Rn  :x,  (j)ktP  >|  ,  and  sn  =<  Rn  xx,  4>kr\pn  >■ 

k,p 

It  is  worth  pointing  out  that  this  encoding  method  is  close  to  being  shift  invariant.  By  trans¬ 
lating  the  whole  signal  by  a  sample  in  any  direction,  most  of  its  correlation  with  the  kernels  will 
suffer  an  appropriate  change  in  position,  but  not  in  value.  In  fact  the  only  places  where  this  does 
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not  hold  true  is  at  the  boundaries,  and  their  number  is  less  and  less  significant  as  the  size  of  the 
signal  gets  larger.  Due  to  the  particular  form  of  the  dictionary  we  are  effectively  combining  a 
convolution-based  method  with  the  greedy  procedure  of  picking  a  very  sparse  coefficient  set. 
Efficient  implementations  of  MP  in  the  one-dimensional  case  have  been  proposed  independently 
by  Sallee  Q105D,  and  Gribonval  et  al.  [[74].  [75]],  which  reduce  the  computational  complexity  of  an 
MP  iteration  to  0 ( K  log  L),  instead  of  ()(KL). 

It  is  possible  to  reduce  the  size  of  the  representation  (in  (0  sense)  by  adapting  the  kernels  to 
the  class  of  signals  they  should  best  represent,  and  thus  by  increasing  their  descriptive  power.  In 
the  following  we  will  regard  the  optimal  dictionary  for  a  given  set  of  points,  in  terms  of  searching 
for  the  mode  of  the  posterior  distribution  in  a  similar  fashion  to  |[JT)  |1()7|] : 

p(x|d>)  =  /  p(x|<3>,  s)p(s)ds  (2.7) 


where  the  integration  is  made  by  marginalizing  over  all  possible  point  sets.  If  the  integral  above 
is  approximated  by  p(x|<l>,  s')p(s'),  where  s'  is  the  set  of  coefficients  produced  by  Matching 
Pursuit,  and  assume  that  e,  the  representation  noise,  is  distributed  according  to  A/"(0,  ae I),  then 
for  every  kernel  (j)k  we  can  find: 


d 

d^>k 


log(p(x|$)) 


d 


94>k 
-1  d 
2ae  dcj)k 

nk 


{log(p(x|$,  s'))  +  log(p(s'))} 


K  nj 


EE4 

3= 1  P=  1 


,p  'rjv  II 2 


o( 


XXp'  tx-x(s,,$)] 


k,p 


p= 1 


(2.8) 

(2.9) 

(2.10) 


where  the  expression  [x  —  x(s,  <I>  )]/,  p  denotes  the  restriction  of  the  error  signal  to  the  support 
of  (j)ktP.  Thus,  we  have  a  learning  rule  for  the  MAP  dictionary.  On  the  other  hand,  maximizing 
the  posterior  with  respect  to  is  equivalent  to  minimizing  the  (squared)  reconstruction  error, 
which  is  simply  a  quadratic  form  of  the  ensemble  f ,  the  concatenation  of  all  the  kernels  in  the 
dictionary.  We  shall  further  develop  this  issue  in  chapter 
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Chapter  3 

Multiresolution  ICA 


3.1  Introduction 


The  problem  of  efficiently  describing  visual  structure  has  been  of  great  importance  to  many 
research  fields  spanning  from  biology  to  engineering  (see  e.g.,  [0[TT7]]).  Best  existing  coders 
(notably  JPEG2000  [|1()9|])  rely  on  the  flexibility  of  multiresolution  (MR)  transforms  to  capture 


structure  in  natural  images  by  exploiting  their  intrinsic  multiscale  character  [|54|,  |85p .  The  most 
important  and  practical  analysis  tool  employed  to  access  this  structure  was  the  Discrete  Wavelet 
Transform  (DWT)  @,  |8(J  [Til 

In  spite  of  their  success,  wavelets  do  have  well-known  limitations  in  terms  of  modeling  or 
detecting  two-dimensional,  sharp,  arbitrarily-oriented  (ridge-like)  discontinuities.  Various  types 
of  MR  representations  emerged  in  the  past  decade  in  computational  harmonic  analysis,  which 
provably  outperform  wavelets  in  approximating  particular  classes  of  signals  [|22|,  pf2|] ).  Because 
of  their  great  diversity,  it  is  not  clear  what  makes  an  optimally  efficient  code  for  images.  Com¬ 
mon  intuition  that  optimal  image  features  are  smooth  surfaces  and  short  straight  edges  may  be 
accurate  for  some  classes  (e.g.,  natural  scenes  ||54]]),  but  not  for  others  (faces,  textures,  cartoons, 
fingerprints,  and  medical  images  of  all  sorts). 

Separating  signal  content  into  different  subbands,  and  concentrating  the  relevant  information 
into  a  small  set  of  non-zero  coefficients,  seems  a  natural  recipe  for  achieving  efficiency.  How¬ 
ever,  a  representation  is  inherently  suboptimal  unless  it  can  capture  the  probability  density  of 
the  data,  according  to  Shannon’s  source  coding  theorem.  As  such,  optimal  efficiency  can  only 
be  achieved  by  adapting  the  representation  to  the  statistical  structure  of  the  target  image  class. 
When  searching  for  the  “most  compact”  code,  one  method  to  employ  is  independent  component 
analysis  (ICA)  |[64|l .  Generally  speaking,  the  goal  of  ICA  is  to  derive  a  data  dependent  linear 
mapping  such  that  the  coefficients  in  this  new  representation  are  maximally  independent.  There¬ 
fore,  a  suitable  mathematical  cost  to  minimize  is  the  mutual  information  among  coefficients. 
Due  to  its  poor  computational  scalability  with  respect  to  data  dimensionality,  ICA  has  been  tra¬ 
ditionally  applied  to  images  (either  for  analysis,  encoding,  or  denoising)  by  extracting  relatively 
small  image  patches  to  be  used  as  training  samples,  followed  by  block-transforming  the  image. 
Unfortunately,  the  arbitrary  alignment  of  the  blocks  with  the  image  and  the  insufficient  capacity 
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Figure  3.1:  Multiresolution  ICA  flowchart. 


to  represent  image  structure  spread  across  blocks  produce  artifacts  at  reconstruction. 


In  this  chapter,  we  propose  an  ICA-like  image  representation,  which  overcomes  the  artifi¬ 
cial  block  confinement  and  computational  obstacles.  Our  method  consists  of  a  preliminary  MR 
( e.g .  wavelet)  decomposition  step,  followed  by  learning  an  ICA  basis  for  each  of  the  resulting 
subbands.  The  purpose  of  the  MR  step  is  to  allow  easier  access  to  structure  at  each  scale,  while 
reducing  the  bulk  of  image  information  to  the  coarsest  scale;  indirectly,  this  helps  in  eliminating 
blocking  artifacts.  Since  the  learned  ICA  bases  provide  the  most  compact  linear  code  for  each 
subband,  we  can  conclude  that  this  hybrid  Multiresolution-ICA  procedure  (henceforth  referred  as 
MrICA)  gives  an  improvement  over  both  types  of  representations.  For  a  flowchart  of  the  MrICA 
procedure  described  above,  see  fig.  |3TT|. 


Efficient  coding  has  been  a  very  suitable  paradigm  in  attempting  to  explain  how  biological 
systems  cope  with  processing  complex  information.  The  resemblance  of  the  optimally  derived 
linear  features  learned  from  natural  scenes  to  the  receptive  fields  of  simple  cells  in  primary  visual 
cortex  (VI)  has  led  to  very  interesting  hypotheses  about  the  role  and  function  of  the  brain’s  sen¬ 
sory  systems  ||T4].  p2[  |1 15Q.  A  probabilistic  modeling  approach  aimed  directly  at  optimal  efficient 
coding  of  natural  images  |[80||  has  revealed  that  the  average  entropy  improvements  of  adaptive 
linear  representations  over  fixed  ones  (Fourier,  DCT,  wavelets,  Gabor  functions  |[35[.  [36]])  are  too 
important  to  neglect.  However,  due  to  the  computational  constraints,  their  representation  was  de¬ 
rived  for  relatively  small  image  patches  and  thus  a  comparison  to  multiscale  bases  was  limited. 
We  can  mention  here  two  other  block-based  ICA  approaches  to  image  compression  [|53|,  |89|]. 
Both  compare  favorably  to  JPEG  (for  faces  and  natural  images),  and  the  first  one  even  outper¬ 
forms  the  FBI  Wavelet  Scalar  Quantizer  (WSQ)  coder  [|T7j.  [18|]  for  fingerprint  images  at  low 
rates;  however,  they  do  not  fully  exploit  the  potential  of  multiresolution.  Modeling  subband  in¬ 
formation  statistically  for  image  coding  has  been  performed  in  [|T9[] ;  the  resulting  coder  (EPWIC) 
explicitly  exploits  statistical  relationships  between  coefficients  in  different  wavelet  subbands  via 
a  parameterized  model.  Another  parametric  adaptive  multiscale  method  has  been  presented  in 


16 


[93Q;  there  the  objective  is  to  adapt  the  parameters  of  a  certain  wavelet-based  transform  to  better 
fit  natural  images.  In  contrast  to  their  approach,  MrICA  derives  an  adaptive  non-parametric  mul¬ 
tiscale  image  representation  by  letting  the  subband  ICA  basis  functions  be  learned  from  scratch, 
therefore  keeping  them  unconstrained.  Finally,  a  different  multiscale  framework  for  blind  sepa¬ 
ration  is  presented  in  [|7T|],  and  [|1 2Q|] .  The  essential  difference  between  their  work  and  the  MrICA 
lies  in  the  nature  of  the  mixtures:  we  regard  the  images  as  being  sample  points/vectors  drawn 
independently  from  a  certain  distribution,  while  in  their  case  the  images  are  the  mixtures  and  the 
samples  correspond  to  sets  of  pixels  drawn  from  all  images,  at  identical  spatial  locations. 


Structure  of  the  chapter.  In  Section  (3.2|.  we  describe  the  main  components  of  our  image  en¬ 
coder.  We  address  the  issues  of  computational  complexity  of  MrICA  in  section  |3.3|.  Section  [3~4| 
describes  in  detail  the  experimental  results  illustrating  the  encoding  performance  of  the  proposed 
adaptive  method,  while  section  [33]  concludes  the  chapter. 


3.2  Adaptive  Multiresolution  Coding 


In  this  section,  we  shall  describe  the  proposed  method  for  MR  adaptive  image  encoding.  We 
shall  start  by  assuming  we  have  a  set  of  sample  images  x1,x2,...,xm  assumed  to  be  drawn  iicl 
from  a  common  distribution  over  EA'  (here  N  is  their  common  size).  We  shall  decompose  these 
images  by  a  fixed  MR  transform  and  then  for  each  of  the  resulting  subband  spaces  we  shall  learn 
an  ICA  basis,  by  using  the  subband  coefficient  sets  of  all  the  images  in  the  sample  as  training 
data.  Finally,  we  shall  use  the  subband  ICA  coefficients  to  design  a  quantizer.  Every  new  image 
will  be  transformed  and  quantized,  and  finally  output  into  a  bitstream  via  an  arithmetic  coder.  In 
the  following,  we  provide  details  on  each  of  the  modules  of  our  system. 


Multiresolution  Transforms.  The  first  step  of  our  hybrid  method  aims  at  separating  image  con¬ 
tent  by  projecting  images  on  scale-orientation  subbands.  The  most  widely  used  MR  transforms 
are  based  on  wavelets  and  this  due  their  theoretical  and  computational  properties.  For  instance, 
JPEG2000  (Part  1)  uses  the  Cohen-Daubechies-Feauveau  9/7  biorthogonal  wavelet  filters  [|34]|,  as 
its  only  supported  “irreversible”  wavelet  transform  [|1 09|] .  We  also  chose  to  employ  this  wavelet 
because  we  wanted  to  test  the  coding  efficiency  of  our  method  against  that  of  the  most  common 
fixed  MR  transform.  To  share  further  similarities  with  existing  image  coders,  we  applied  this 
separable  decomposition  method,  using  whole-point  symmetric  edge  handling.  The  implemen¬ 
tation  we  used  in  our  experiments  was  that  of  Matlab  Wavelet  Toolbox  4.2.  (For  the  sake  of 
completeness,  let  us  mention  that  the  CDF  9/7  wavelets  are  referred  there  as  '  bior  4 . 4  ' .) 

Formally,  we  consider  that  the  wavelet  transform  is  represented  by  a  NxN  invertible  matrix 
M,  partitioned  (row-wise)  into  NkxN  matrices  M/.,  with  1  <k<K.  In  our  case,  the  submatri¬ 
ces  correspond  to  different  scale/orientation  subbands.  For  each  subband  k,  and  image  x?,  let 
x  -  =  M/,:Xj  be  the  coefficients  of  the  image  over  the  subband.  The  goal  MrICA  is  to  derive 
for  each  k  the  NkxNk  matrices  Wfc  such  that  s-  =  Wfcx^  are  realizations  of  a  random  vector 
with  maximally  independent  components.  In  other  words,  within  each  subspace  defined  by  the 
partition  (M/,:)A.  we  search  for  the  linear  mapping  allowing  us  to  best  approximate  the  projected 
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data  distribution  by  a  product  of  marginals. 

Adaptation.  In  unsupervised  learning,  the  problem  of  separating  signals  into  independent  linear 
components  can  be  formulated  as  follows:  given  a  set  of  ^-dimensional  vectors  (y<k,)\<k<K, 
search  for  a  linear  transform  A  such  that  the  observed  vectors  are  linear  mixtures  (induced  by 
A)  of  realizations  of  an  M-dimensional  random  vector  Z  =  (zi, ... ,  zm)t  whose  components 
are  as  independent  as  possible.  We  can  express  this  model  compactly  as: 

Y  =  AZ  (3.1) 

where  Y  G  WNxK,  A  G  MJVxM,  and  Z  G  M.MxK .  The  ICA  computational  objective  is  then 
to  find  the  linear  transform  A,  such  that  the  mutual  information  among  the  coefficients  z%  is 
minimized.  To  simplify  the  description,  we  will  assume  that  A  is  square  and  invertible  (that  is, 
M  =  N)  and  if  we  denote  its  inverse  by  W,  the  problem  is  reduced  to  minimizing: 

M 

I(zi,  ...,zn)  =  Y^  H{Zj)  -  H(Y)  -  log  I  det  W|  (3.2) 

3= 1 

Since  the  entropy  of  the  observed  mixture  Y  is  constant,  imposing  |  det  W|  =  1  causes  the 
quantity  we  seek  to  minimize  to  be  the  sum  of  the  coefficients’  marginal  entropies;  that  is,  ICA 
searches  for  the  transformation  giving  the  (potentially)  most  compact  linear  code  of  the  data. 

When  the  size  of  the  training  images  is  very  high  let  us  observe  that  the  size  of  the  subbands 
at  the  first  decomposition  level  (roughly  one  quarter  of  that  of  the  original  image)  is  still  very 
large  and  we  cannot  learn  a  complete  basis  for  the  subband.  However,  by  applying  a  variant  of 
our  method  called  modified  MrICA  we  can  overcome  this  obstacle.  Namely,  we  impose  that  for 
all  subbands  up  to  some  decomposition  level  L'  we  learn  ICA  bases  in  a  block-based  fashion, 
while  for  the  coarsest  subbands  we  perform  MrICA  as  usual.  As  we  will  later  point  out,  the 
computational  savings  will  be  tremendous  since  we  only  need  to  solve  a  linear  number  of  conve¬ 
niently  small  ICA  problems.  Moreover,  since  most  of  the  wavelet  coefficients  are  already  very 
small,  there  will  be  virtually  no  blocking  artifact  in  the  reconstruction.  (Let  us  point  out  that 
this  is  not  the  same  as  decomposing  the  image  into  moderately  large  blocks  ( e.g .,  64  x  64)  and 
applying  MrICA  to  the  blocks.) 

Quantization  and  Coding.  Next,  we  shall  describe  the  subband  coding  procedure  employed 
to  transform  the  coefficients  into  bitstreams,  for  both  the  wavelets  and  MrICA.  For  a  group  of 
images  from  the  training  set,  we  group  the  MR  coefficients  belonging  to  the  same  subband  and 
from  the  whole  group,  we  estimate  a  scalar  quantizer.  Note  that  scalar  quantization  is  justified 
in  the  case  of  MrICA,  since  coefficients  within  each  subband  are  as  independent  as  possible.  To 
design  the  subband  quantizers,  individual  bit  rates  are  allocated  according  to  the  relative  energy 
within  each  subband.  Since  we  are  interested  in  the  potential  improvement  of  the  adaptive  rep¬ 
resentation,  and  less  so  in  the  great  many  practical  issues  of  image  coding,  we  will  compute  the 
“optimal”  entropy-constrained  scalar  quantization  [|52|]  for  each  subband.  This  should  provide 
a  reliable  upper  bound  for  the  performance  of  each  representation.  After  quantizing  the  coef¬ 
ficients,  we  use  Matlab  Communication  Toolbox’s  arithmetic  coder  to  construct  the  bitstreams 
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and  record  the  total  bitstream  length  and  the  reconstruction  SNR  for  each  test  image.  Then,  we 
take  the  average  over  the  whole  test  set  to  estimate  the  coding  efficiency  of  the  distribution.  We 
repeat  this  procedure  for  various  target  rates,  and  by  interpolation  we  construct  the  rate-distortion 
curve. 


3.3  Complexity  Issues  of  MrICA 

An  important  aspect  that  motivates  our  hybrid  method  is  the  lack  of  ability  to  compute  an  ICA 
basis  for  large-dimensional  data.  As  we  have  mentioned  in  the  background  chapter  (see  sec¬ 
tion  m  the  computational  cost  becomes  prohibitive  because  of  the  large  number  of  parameters 
we  need  to  estimate.  In  the  following  we  shall  explain  how  this  problem  is  handled  by  MrICA. 

For  simplicity,  let  us  assume  that  we  employ  a  wavelet  basis  for  the  MR  step;  for  one  de¬ 
composition  level,  this  will  produce  three  detailed  subbands  (horizontal,  vertical,  and  diagonal) 
and  one  approximation  subband  whose  dimension  will  be  1/4  of  the  original  image  size.  Let  us 
denote  by  T  (n)  the  computational  cost  of  performing  one  ICA  iteration  when  the  size  of  the  data 
is  nxn;  suppose  the  original  image  size  is  dxd  and  say  we  use  the  wavelet  decomposition  with 
L  resolution  levels.  Then,  the  total  cost  of  one  ICA  step  across  subbands  is: 


r(^)  +  r(4)  +  ...  +  r(4) 


rr,,  d  . 
+  T<2E> 


(3.3) 


The  iteration  cost  of  a  typical  off-line  ICA  algorithm  which  involves  a  matrix  multiplication,  (or 
an  inversion,  or  a  re-orthogonalization)  is  of  the  order  n21og7.  Thus  a  rough  estimate  of  the  ratio 
between  the  cost  of  a  MrICA  iteration  (again,  for  all  subbands)  and  T(d)  is 
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where  a  =  2_21og7  ps  0.0204.  Thus,  the  computational  savings  are  significant,  the  entire  process 
getting  to  be  roughly  16  times  faster.  If  we  also  take  into  account  the  fact  that  for  a  smaller 
problem  we  generally  need  fewer  iterations  to  converge,  we  realize  that  even  higher  savings  are 
possible.  Let  us  point  out  that  in  the  case  of  the  modified  MrICA,  which  treats  the  most  detailed 
subbands  up  to  level  L'  <  L  in  a  block-based  fashion,  the  approximate  cost  ratio  computed  like 
above  becomes  (3 L'  +  4 )aL').  For  a  very  large  image,  and  even  a  moderate  limit  level  L',  the 
cost  is  reduced  tremendously. 

We  should  emphasize  that  for  each  image  class  that  we  are  interested  to  represent,  we  need  to 
pay  the  computational  price  of  learning  only  once.  That  is,  having  learned  the  MrICA  basis  for 
our  class  we  will  be  able  to  use  it  whenever  is  necessary.  This  is  a  common  practice  in  (off-line) 
machine  learning  and  thus  applies  to  our  adaptive  signal  coding  setting. 
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(c)  Detailed  Subband  H\  (d)  Detailed  Subband  l)  \ 


Figure  3.2:  Basis  functions  computed  by  MrICA  with  1  MR  decomposition  level  for  32x32  log- 
scale  natural  images  (see  text).  For  each  subband,  a  random  set  of  basis  functions  are  displayed. 


3.4  Experimental  Results 


We  shall  illustrate  the  encoding  performance  of  MrICA  by  plotting  the  average  rate-distortion 
curves  generated  by  coefficient  quantization  at  various  levels  of  precision,  for  both  the  fixed  and 
the  adaptive  MR  transforms.  First  of  all,  however,  let  us  comment  on  the  features  learned  by  our 
method  when  applied  to  natural  images. 

We  applied  MrICA  to  encoding  natural  scenes  randomly  cropped  from  van  Hateren’s  database 
of  natural  stimuli  [|1 16||.  We  tested  our  method  on  images  of  two  sizes,  32  x  32  and  respectively, 
64  x  64  pixelsQ  Instead  of  working  with  the  pixel  intensities,  we  took  the  logarithm  of  these 
intensities  before  any  further  processing;  as  explained  in  Ql  16Q,  the  reasons  for  this  operation  are 
to  incorporate  contrast  invariance  of  natural  scenes,  get  better  first-order  statistics  of  the  natural 
image  data,  and  better  mimic  the  operations  performed  by  the  first  stages  of  visual  systems.  On 
each  of  these  logarithmically  transformed  images,  we  applied  the  discrete  wavelet  decomposition 
and  learned  the  subband  ICA  matrices,  as  described  in  the  previous  section. 

Interpreting  the  ICA  objective  as  maximizing  the  (log-)likelihood  of  the  data  under  the  linear 
model,  or  as  minimizing  the  Kullback-Leibler  divergence  between  the  joint  probability  and  the 
product  of  marginals  has  produced  several  families  of  ICA  algorithms  (see  section  |2l2|).  For  the 
results  reported  in  this  section  we  employ  the  Relative  Trust-Region  algorithm  [|29|].  Figure  [T2| 


'As  they  are  similar  to  JPEG2000  standard  code  blocks  sizes,  we  considered  these  image  sizes  relevant  to  use 
for  comparison  purposes. 
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displays  a  random  set  of  such  ICA  basis  functions  learned  from  the  32  x  32  data  set  with  one 
MR  decomposition  level.  MrICA  basis  functions  of  the  approximation  subband  retain  the  aspect 
of  classic  image  ICA  basis  functions  (relatively  low  spatial  frequency,  all  orientations)  [JT4|,  |1  16||, 
which  is  not  a  surprise  considering  that  the  approximation  subband  contains  a  low-resolution 
version  of  the  original  image.  On  the  other  hand,  the  detailed  subbands  basis  functions  look  like 
localized  features,  preserving  the  dominant  orientation  of  the  subband.  Besides  the  quantitative 
(mutual  information)  difference  between  the  MrICA  detailed  bases  and  corresponding  wavelets, 
we  note  that  the  adaptive  features  are  also  more  diverse  in  shape  (i.e.,  they  are  not  shifted  copy 
of  a  single  wavelet  kernel). 


Next,  we  illustrate  the  improvement  in  coding  efficiency  afforded  by  adaptiveness.  In  ad¬ 
dition  to  comparing  the  adaptive  and  non-adaptive  MR  methods  described  in  the  previous  sec¬ 
tion,  we  also  compare  both  of  them  against  JPEG2000;  for  this  purpose,  we  used  Jasper  [|83|],  a 
software  package  implementing  JPEG2000.  The  coding  cost  of  the  adaptive  and  non-adaptive 
representations  does  not  include  the  basis  functions  (respectively,  the  wavelets),  as  these  can 
rightfully  be  considered  part  of  the  coder,  and  not  of  the  code;  also,  in  case  of  Jasper,  we  report 
only  the  codestream  length  (i.e.,  not  including  the  metadata).  In  the  case  of  MrICA,  the  images 
included  for  the  evaluation  are  taken  from  the  testing  set,  that  is,  they  belong  to  the  same  signal 
class  as  those  in  the  training  set  but  have  not  been  used  during  learning.  Let  us  point  out  that  the 
JPEG2000  performs  quantization  for  each  individual  image,  and  not  over  a  whole  sample  set, 
unlike  our  method.  In  this  respect,  our  quantizers  take  advantage  of  more  information.  On  the 
other  hand,  JPEG2000  performs  surprisingly  well  considering  that  we  used  an  optimal  ECSQ, 
and  not  a  uniform  one.  The  rate-distortion  trade-off  obtained  for  the  32  x  32  test  images,  with 
one  MR  level,  for  the  three  encoding  methods,  are  presented  in  Figure  |3.4|.  The  top  plot  shows 
the  relative  coding  gain,  while  the  bottom  plot  shows  the  relative  bit-rate  difference  of  the  three 
methods,  taking  the  non-adaptive  wavelet  representation  as  reference.  The  better  coding  effi¬ 
ciency  of  MrICA  (more  apparent  at  low  bit  rates)  has  two  important  consequences:  the  same 
distortion  (or  SNR)  can  be  achieved  by  the  adaptive  method  for  a  significantly  lower  rate  (i.e., 
bit  cost)  and,  reciprocally,  for  a  fixed  bit  rate  we  can  get  a  significantly  better  improvement  in 
fidelity. 


As  it  is  well  known,  SNR  is  not  a  relevant  measure  of  perceptual  distortion;  instead,  eval¬ 
uating  the  representational  power  of  MrICA  should  also  involve  assessing  the  presence  of  re¬ 
construction  artifacts.  For  this  purpose,  we  chose  to  display  several  test  images  from  the  two 
datasets,  their  encoded  version  via  the  adaptive  and  non-adaptive  MR  transforms,  and  the  resid¬ 
ual  errors.  Figure  [33]  illustrates  the  encoding  results  of  six  examples  from  the  32  x  32  dataset. 
Each  image  has  been  encoded  to  a  quality  of  25dB;  the  coding  gains  of  MrICA  over  the  non- 
adaptive  wavelet  method  for  these  images  are:  2.43  bpp,  0.62  bpp,  2.91  bpp,  2.78  bpp,  3.39 
bpp,  and  2.69  bpp.  Figure  [kf]  shows  six  examples  of  64  x  64  images  encoded  at  20dB.  The 
coding  gain  values  of  the  adaptive  method  are  in  this  case  1.5  bpp,  1.48  bpp,  0.33  bpp,  0.23  bpp, 
0.45  bpp,  and  1.23  bpp.  (For  both  figures,  the  colormaps  are  maximally  stretched  to  enhance 
visibility.)  As  a  general  conclusion,  MrICA  obtains  a  better  coding  rate  than  the  fixed  wavelet 
representation,  with  fewer  reconstruction  artifacts. 
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3.5  Concluding  Remarks 


We  proposed  MrICA,  a  hybrid  method  that  combines  the  advantages  of  both  adaptive  and  mul¬ 
tiresolution  representations.  We  illustrated  the  significant  coding  efficiency  gain  of  MrICA  over 
the  wavelet  transform  when  applied  to  natural  images,  which  is  explained  by  the  ability  of  the 
new  method  to  adaptively  describe  image  structure  at  all  scales.  This  suggests  that  an  image 
coder  devoted  to  a  given  class  of  signals  should  use  not  only  multiresolution,  but  also  adaptivity 
to  optimize  encoding  performance. 

One  particularly  important  issue  that  remains  unsolved  is  the  optimality  guarantee:  it  would 
be  very  useful  to  find  an  explicit  relationship  between  the  lower  bound  of  the  “classic”  ICA 
objective  function,  and  the  one  achievable  by  MrICA,  when  applied  to  the  same  data.  Since 
our  method  reduces  only  intra-band  redundancies,  we  expect  that  in  general  this  difference  will 
not  be  negligible.  This  is  definitely  a  very  interesting  direction  that  we  intend  to  pursue  in  the 
future  work,  as  it  will  help  us  measure  the  trade-off  between  coding  efficiency  and  computational 
complexity  more  accurately. 

Finally,  let  us  mention  that  applying  the  method  described  here  is  by  no  means  restricted  to 
the  class  of  natural  images.  Indeed,  MrICA  provides  a  general  framework  for  data-dependent 
signal  coding  which  could  potentially  be  useful  in  representing  more  restricted  image  classes, 
such  as  faces[j  and  medical  images,  and  even  to  other  modalities  (e.g.,  speech  and  video). 


2One  particularly  interesting  application  we  recently  heard  of  is  the  Automatic  Cameraman  project  at  UCSD 

[Hi 


22 


o  o 


71/2 


(a)  32  x  32  blocks  without  MR  decomposition. 
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(b)  32  x  32  blocks,  1  MR  decomposition  level. 
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(c)  32  x  32  blocks,  2  MR  decomposition  levels. 


Figure  3.3:  Layout  of  the  parameters  corresponding  to  MrICA  basis  functions,  in  the  Spatial 
frequency  (radial)  vs.  Orientation  (angular)  domain.  The  black  circles  represent  the  parame¬ 
ters  of  MrICA  basis  functions  computed  for  the  approximation  subband.  Colored  circles  rep¬ 
resent  basis  functions  from  the  intermediate  detailed  subbands  (red=horizontal,  green=vertical, 
blue=diagonal).  Colored  dots  represent  basis  functions  from  the  highest  resolution  detailed  sub- 
bands. 
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Figure  3.4:  Relative  rate-distortion  performance  of  three  methods  (MrICA,  wavelet,  JASPER) 
computed  for  the  32  x  32  test  images. 
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Figure  3.5:  Examples  of  32  x  32  images  reconstructed  at  25dB.  Column  1. Original  images; 
2. Image  encoded  by  non-adaptive  method;  3. Error;  4.1mage  encoded  by  MrICA  ;  5. Error 
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Figure  3.6:  Examples  of  64  x  64  images  reconstructed  at  20dB.  Column  1. Original  images; 
2. Image  encoded  by  non-adaptive  method;  3. Error;  4.1mage  encoded  by  MrICA  ;  5. Error 
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Chapter  4 
Point  Coding 


4.1  Introduction 

Efficient  image  representation  is  an  important  research  problem,  due  to  its  practical  applications 
Hi  and  to  its  potential  as  a  principled  approach  to  modeling  natural  vision  p2|j.  Capturing 
the  structure  of  visual  signals  and  encoding  it  compactly  is  a  challenging  task.  For  example, 
relevant  visual  content  can  appear  at  any  spatial  position  and  scale,  which  explains  the  success 
of  image  coders  based  on  multiscale  representations  such  as  wavelets  However,  even 

wavelets  prove  suboptimal  in  modeling  certain  structure  frequently  occurring  in  images  (among 
other  things,  sharp  edges  at  arbitrary  orientations)  and  recently  several  new  representations  have 
been  designed  to  fill  this  gap  (see  for  instance  mmm)- 

In  spite  of  spectacular  progress,  it  still  is  not  clear  what  constitutes  an  optimally  efficient 
code  for  images  in  general:  smooth  surfaces  and  short  straight  edges  may  be  the  optimal  features 
for  some  image  classes  ( e.g .,  natural  scenes),  but  not  for  others  (faces,  cartoons,  fingerprints, 
various  types  of  texture,  or  medical  images).  Furthermore,  according  to  Shannon’s  source  coding 
theorem,  a  representation  is  inherently  suboptimal  for  a  given  class  of  signals  of  interest,  unless 
it  captures  the  probability  density  of  the  data.  This  suggests  that  better  representations  can  be 
obtained  by  learning  more  general  and  flexible  dictionaries  that  reflect  the  statistical  structure  of 
particular  image  classes.  In  this  chapter,  we  focus  on  adaptively  deriving  dictionaries  defined  by  a 
set  of  relatively  small  image  patterns  (henceforth  called  “kernels”),  shifted  at  arbitrary  positions. 
The  goal  is  to  find  such  a  set  of  kernels  for  which  any  signal  in  the  class  of  interest  has  a  sparse 
linear  representation.  Each  coefficient  in  the  sparse  set  represents  a  triple:  one  component  is  its 
scaling  value  (including  sign),  the  second  is  the  index  A:  of  a  kernel,  and  the  third  one  is  a  point 
p  in  signal  space  -  the  location  of  the  shifted  kernel  k.  Therefore,  from  now  on  we  refer  to  the 
problem  above  as  Point  Coding.  The  usual  approach  to  computing  a  solution  is  to  minimize  a 
two-term  cost  function:  the  first  term  measures  the  fidelity  of  reconstruction  (usually,  an  error 
term),  while  the  second  term  stands  for  some  form  of  regularization  (in  our  case,  sparsity).  By 
employing  different  choices  for  these  two  terms  various  algorithms  can  emerge,  each  with  its 
own  advantages  and  technical  challenges. 

This  area  of  research  has  been  particularly  active  in  recent  years,  and  several  interesting  di- 
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rections  have  been  explored  [0,  H,  |67|.  01.  However,  this  setting  is  by  no  means  particular 
to  images:  approaches  to  sparse  adaptive  representation  of  general  types  of  signals  exist  which 
focus  on  sounds  [[3].  ^4j.  |I08|J,  or  combined  audio-visual  signals  [|88]1.  Our  goal  is  to  bring 
forward  computationally  efficient  methods  for  designing  very  general,  (approximately)  shift- 
invariant  adaptive  image  representations.  The  generality  refers  to  the  fact  that  the  kernel  sizes 
can  be  distinct  and  arbitrary ;  the  user  gives  up  the  control  over  this  issue  to  the  kernel-learning 
procedure  and  the  adaptive  process  may  even  produce  a  multiscale  representation  if  optimality 
demands  it  (in  a  similar  fashion  to  Spike  Coding  ||1 06[  |107|]).  Computational  efficiency  stems 
from  exploiting  the  structure  of  the  optimization  problem,  using  tools  imported  from  structured 
matrix  algebra  [|54],  [70|.  |94]].  Namely,  the  coefficient  extraction  step  uses  a  fast  implementation  of 
Matching  Pursuit  with  essentially  logarithmic  cost  per  iteration,  while  dictionary  update  is  per¬ 
formed  by  solving  a  highly  structured  least-squares  problem,  either  by  algebraic  characterization 
of  pseudoinverses  of  certain  structured  matrices  []6Q|l ,  or  by  fast  interpolation  methods  [[76],  |1 14fl. 

The  chapter  is  organized  as  follows.  Section  [O] contains  the  mathematical  formulation  of  the 
problem.  Next  we  describe  the  alternative  steps  of  the  numerical  solution:  section  [O]  reviews 
the  sparse  coefficient  extraction  and  section  [T4|  is  concerned  with  the  dictionary  update.  We 


include  the  experimental  results  in  section  [43]  and  present  our  conclusions  in  section 


4.2  The  Point  Coding  Problem 

In  this  section  we  shall  formulate  the  Point  Coding  problem  mathematically.  Let  us  start  by 
introducing  the  appropriate  notation. 

Let  $  =  {(j) i, ...,  (j)K}  be  a  set  of  two-dimensional  (rectangular)  kernels,  of  possibly  different 
sizes  mkxnk,  normalized  to  unit  Frobenius  norm,  and  let  f  =  (vec((ft1)T, . . . ,  vec (cf)K)T)T  the 

K 

ensemble  obtained  by  concatenating  their  vectorized  versions]];  thus,  if  D  =  rnknk,  then 

k=  1 

f  e  Md.  For  each  kernel  </>&€$  and  for  p  6  IN 2 ,  we  denote  <pk)P  the  translated  version  of  the 
kernel  such  that  its  upper-left  corner  lies  at  position  p.  (Here,  we  shall  work  exclusively  with 
finite-size  images,  which  means  that  if  (j)k  is  entirely  contained  within  an  MxN  image,  it  can 
only  be  shifted  into  (M  —  mk  +  1)  x  (N  —  nk  +  1)  positions.) 

For  all  k  and  p,  the  coefficient  of  the  shifted  kernel  (j)k:P  will  be  denoted  skj).  Under  the  linear 
additive  noisy  model  assumption,  for  any  image  x  we  can  write: 

K 

x  =  ^  ^  sfciP  •  +  e  =  x(s,  $)  +  e  (4.1) 

k= 1  p&Pk 


where  Pk  is  the  set  of  all  occurrences  of  kernel  cj)k  in  the  representation^.  As  a  measure  of 

'For  a  matrix  M,  vec (M)  is  the  set  of  all  the  entries  in  the  matrix,  stored  column-wise. 

2For  brevity,  we  will  further  refer  both  to  <1>  and  to  f  as  the  encoding  dictionary.  Also,  we  will  refer  to  the 
coefficients  sk,P  as  points.  Note  again  that  one  point  is  determined  not  only  by  the  value  of  the  coefficient,  but  also 
by  its  corresponding  kernel  and  by  the  position  where  it  occurs. 
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representation  accuracy,  we  hereby  consider  the  (squared)  reconstruction  error: 

K  <4 

Fx($,s)  =  ||x-x(s,$)||!  =  (4.2) 

k= 1  p=  1 

where  for  all  k,  dk  =  |  Pk  is  the  number  of  shifted  versions  of  kernel  (j)k.  Optimizing  for  sparsity 
of  the  representation  translates  into  minimizing  the  number  of  non- zeros  in  s.  Therefore,  for  a 
fixed  signal  x,  we  should  solve  the  following  optimization  problem 

min  1 1  s  1 1  o 

$,s 

s.t.  Fx  (<f>,  s)  <  e 


for  e  >  0  or  equivalently: 


mm 

€>,  s 


Fx($,s)  +  A||s| 


(4.3) 


for  some  A  >  0. 

The  optimization  problems  above  are  NP-hard  (see  for  instance  []37|,  [38],  [3^,  |1 1 1|]).  Therefore, 
we  attempt  to  approximately  find  a  solution  via  an  iterative,  alternating  procedure.  First,  we  find 
a  sparse  set  of  points  corresponding  to  a  fixed  dictionary  and  a  preset  level  of  precision;  then,  for 
a  fixed  set  of  coefficients,  update  the  dictionary  to  better  fit  the  data.  Finding  the  sparsest  linear 


approximation  in  a  general  dictionary  is  also  NP-hard  09QQ;  however,  suboptimal  approaches  (like 
greedy)  proved  quite  satisfactory  in  practice.  Therefore,  for  the  first  step  we  choose  to  employ 
Matching  Pursuit  [|87|].  The  second  step,  adapting  the  dictionary  to  the  signal  structure  only 
requires  solving  a  quadratic  (i.e.,  convex)  optimization  problem  in  <I>  (or  f).  In  the  following,  we 
shall  describe  each  of  the  two  steps  separately. 


4.3  Matching  Pursuit 

The  Matching  Pursuit  (MP)  algorithm  [|87|]  is  a  greedy  iterative  procedure  whose  goal  is  to  iden¬ 
tify  a  decomposition  of  a  given  vector  as  a  linear  combination  of  elements  of  a  dictionary.  If  the 
dictionary  is  an  orthogonal  vector  set  and  the  signal  is  indeed  a  sparse  combination  of  atoms,  MP 
is  guaranteed  to  find  this  sparse  set.  In  general,  this  method  only  serves  as  an  approximation  to 
the  sparsest  set  problem  (see  for  instance  []90|l).  For  a  detailed  presentation  of  how  MP  can  be 
used  to  compute  sparse  signal  representations  in  a  shiftable-kernel  dictionary  see  section  0 
The  main  practical  challenge  in  using  Matching  Pursuit  with  a  high-dimensional,  highly  over¬ 
complete  dictionary  is  the  large  cost  of  the  update  and  of  identifying  the  next  atom,  maximally 
correlated  with  the  residual.  In  the  case  of  a  1-D  shiftable-kernel  dictionary  with  small  ker¬ 


nels,  approaches  presented  in  [|74|,  [75lj  and  (|105|1  shrink  this  cost  to  be  ()(K L),  i.e.,  essentially 
logarithmic  in  the  size  of  the  signal  (we  assume  that  the  number  of  kernels  and  their  sizes  are 
small  constants  compared  to  the  signal  size).  The  difference  between  the  two  approaches  is 
that  in  MPTK  [f74|,  }75|]  the  logarithmic  cost  is  obtained  by  searching  for  the  maximum  coeffi¬ 
cient  within  a  balanced  binary  tree,  while  in  Sallee’s  work  [|105[]  it  is  achieved  by  maintaining  a 
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Top  of  heap  holds  coordinates  of 


K  arrays. 

(correlation 

coeffs) 


Figure  4.1:  A  graphical  depiction  of  the  data  organization  for  the  efficient  Matching  Pursuit 
algorithm. 


heap  [|32|]  which  holds  the  maximum  correlation  coefficients  of  the  kernels  with  the  signal  over 
equal-length  adjacent  blocks.  Thus,  every  pair  of  clelete-max  and  insert  heap  operations,  helps 
achieving  the  desired  cost. 

We  employ  the  second  approach,  slightly  adapting  it  to  the  particularities  of  the  2D  case. 
Namely,  we  compute  the  correlation  coefficients  of  the  image  with  all  the  kernels  in  our  dictio¬ 
nary  and  then  divide  these  correlation  maps  into  blocks  of  size  m0xn0,  where  m0  =  max;,,  rrik 
and  n0  =  ma,xknk.  At  each  step,  we  shall  only  update  a  small  number  of  blocks  (at  most  4) 
and  therefore  we  only  need  to  search  for  a  small  number  of  new  maxima.  (Figure  M  contains 
a  graphical  description  of  this  data  structure.)  The  heap  structure  insures  that  we  avoid  most  of 
the  work  required  by  searching  the  new  maxima;  also,  careful  storage  of  the  correlation  matrices 
further  can  help  by  enhancing  data  locality  and  thus  avoid  costly  memory  operations.  As  a  typi¬ 
cal  result,  decomposing  a  256  x  256  image  to  30dB  reconstruction  quality  using  a  dictionary  of 
25  Gabor-looking  8x8  kernels  can  be  executed  in  as  little  as  6.4  seconds  on  a  G5  Mac  computer 
(with  a  MEX  C  implementation  of  Matching  Pursuit). 

Once  MP  computes  a  sparse  set  of  coefficients  (now  considered  fixed),  we  proceed  to  opti¬ 
mizing  the  kernels  to  better  fit  the  signals.  Let  us  note  that  it  is  not  trivial  to  adapt  the  “orthogo¬ 
nal”  variants  of  MP  (see  [231)  to  the  setting  described  above.  Such  methods  attempt  to  remove  the 
intrinsically  suboptimal  choice  of  MP  by  modifying  the  criterion  (maximally  correlated  atom) 
as  well  as  adding  other  (also  greedy)  heuristics  as  a  subsequent  refinement  step.  We  are  still 
investigating  the  possibility  of  adapting  these  ideas  to  the  shiftable-kernel  MP  setting. 
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4.4  Dictionary  Update 


In  the  following  we  describe  the  computation  of  the  optimal  dictionary  for  a  given  set  of  points 
in  terms  of  searching  for  the  mode  of  the  posterior  distribution,  in  a  similar  fashion  to  []93|,  |107|] : 

p(x|<3>)  =  j  p(x|<3>,  s)p(s)ds  (4.4) 


where  for  integration  we  marginalize  over  all  possible  point  sets.  We  can  approximate  the  integral 
above  with  p(x|3>,  s')p(s'),  where  s'  is  the  set  of  coefficients  produced  by  Matching  Pursuit. 
Then,  assuming  an  additive  Gaussian  noise  model  e~A/"(0,  creI),  we  can  compute  the  gradient 
for  every  kernel  </>&  as  follows: 

r\  r\ 

—  Iog0(x|$))  =  —  {log{p(x\®,s'))  +  log(p(s'))}  (4.5) 

—  Id  K  Hk' 

=  (4-6) 

e  k'= i  p= i 

nk 

=  —  J2Sk’P'  [x-x(s,$)]fciP  (4.7) 

p=i 

where  [x  —  x(s,  #)]fe  denotes  the  restriction  of  the  error  image  to  the  support  of  4>k,p-  This  im¬ 
mediately  gives  us  a  learning  rule  for  the  MAP  dictionary  and  we  could  employ  any  (stochastic) 
gradient  based  method  to  perform  the  optimization. 

Let  us  observe  that  maximizing  the  posterior  with  respect  to  $  means  minimizing  the  (squared) 
reconstruction  error,  which  is  simply  a  quadratic  form  of  the  ensemble  f.  By  a  slight  abuse  of 
notation,  we  shall  designate  x,:  as  being  both  the  training  image  x,  itself  and  its  associate  vector 
in  MLi  (L,  is  the  size  of  image  x,).  Now  take  to  be  S W  =  . . . ,  S'OTO]  G  MLiXD  the  matrix 

corresponding  to  the  linear  mapping 

xi(sW,^)  =  5'W-f  (4.8) 


and  so  the  optimization  problem  we  need  to  solve  reduces  to  minimizing  the  following  cost 
function: 


Q(  f) 


SWf|l2 


Di> 

i=l 

I 

Y  (||xi||2-2xfS'wf  +  iTS^TS^i) 


i=l 


ct.  +  Y  (-2xfS(i)f  +  fTSWTSwf) 


i= 1 


ct.  -  2  ^Y*js{i)  )  f  +  fT  (  Y  S(l)TS(l)  )  f 

c  -  2brf  +  frAf 


,  2—1 


(4.9) 

(4.10) 

(4.11) 

(4.12) 

(4.13) 
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This  quadratic  form  has  a  special  structure:  since  matrix  S  W  is  a  block-row  whose  blocks  are 
each  Toeplitz-Block-Toeplitz  (TBT)  matrices^,  it  follows  that  matrices  (S'WT(S'W  win  be  mosaic 
TBT  matrices  of  identical  block  sizes.  Consequently,  matrix  A  will  be  a  symmetric,  positive 
semidehnite,  DxD  mosaic  TBT  matrix  and  therefore  we  can  reduce  the  original  problem  to  a 
structured  least-squares  problem. 

Structured  Least  Squares.  The  advantage  of  working  with  structured  matrices  comes 
mainly  from  the  fact  that  the  number  of  parameters  is  much  smaller  than  the  actual  dimension  of 
the  matrix  and  from  the  existence  of  fast  and  superfast  algorithms  that  exploit  the  displacement 
rank  of  many  such  matrices  0,01. 

An  algebraic  characterization  of  pseudoinverses  of  Toeplitz  and  Hankel  mosaic  matrices  has 
been  presented  in  [fjOj],  which  generalizes  the  well-known  Gohberg-Semencul  inversion  formula 
for  Toeplitz  matrices,  by  using  a  general  notion  of  Bezoutian.  The  effect  is  that  fast  and  superfast 
algorithms  can  then  be  employed  to  compute  the  pseudoinverses,  and  consequently  to  solve  the 
structured  least-square  problems. 

Definition  1.  A  ( q,p)-mosaic  matrix  B  is  said  to  be  a  generalized  Toeplitz  (q,p,  r) -Bezoutian  if 
its  generating  function  admits  the  representation 

B{\p)  =  y^-U(\)V(p)t.  (4.14) 

where  U (A)  is  a  q  x  (p  +  q  +  r )  and  V (p)  is  ap  x  (p  +  q  +  r)  matrix  polynomial. 

Then,  the  following  theorem  provides  a  characterization  of  the  pseudoinverse  of  a  Toeplitz  - 
mosaic  matrix,  which  means  that  actually  computing  the  pseudoinverse  can  be  performed  effi¬ 
ciently  via  several  convolutions  (in  our  case  p  —  q  —  K,  the  number  of  kernels). 

Theorem  4.4.1.  iff]/  The  Moore-Penrose  inverse  of  a  (p,  q)  —  Toeplitz  mosaic  matrix  is  a  Toeplitz 
( q,p ,  q  +  p)—  Bezoutian. 

By  adapting  the  result  above  to  the  case  of  mosaic  TBT  matrices,  we  obtain  an  analog  char¬ 
acterization  of  matrix  A. 

A  different,  but  equally  attractive  solution  to  the  structured  problem  above  is  suggested  by 
the  approaches  in  [|1 13|,  |1  14|[.  Namely,  they  translate  the  Toeplitz  least  squares  problem  into  an 
interpolation  problem,  which  can  then  be  solved  by  using  a  superfast  method.  Finding  generators 
for  the  displacement  of  our  particular  mosaic  TBT  matrix  is  rather  straightforward  by  using  the 
algorithm  in  the  appendix  of  [|69[];  this  helps  us  reduce  our  own  problem  to  a  similar  interpolation 
problem,  which  therefore  can  be  solved  efficiently  in  0(D  log2  D). 


4.5  Experimental  Results 

In  this  section,  we  shall  present  the  results  of  applying  the  above  method  to  fairly  different  cate¬ 
gories  of  images,  which  illustrates  the  importance  of  the  signal  class  on  the  adapted  dictionary. 

3We  find  it  useful  to  point  out  that  the  ID  correspondent  is  a  block-row  matrix  with  Toeplitz  blocks  (also  known 
as  Toeplitz- striped  matrix). 
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(a)  Natural  image. 
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Figure  4.2:  Results  of  applying  the  Point  Coding  method  to  natural  images  in  the  Kyoto  database. 
The  dictionary  was  initialized  with  K  =  25  random  kernels  of  size  10  x  10.  Kernels  are  up-scaled 
for  a  better  visualization;  actual  pixel  size  is  displayed  above  each  kernel  subplot. 


ng  to  representatives  of  under¬ 
ted  minority  groups.  “It’s  inex- 
ible,”  said  Eric  Rodriguez,  sen- 
►olicy  analyst  with  the  National 
icil  of  La  Raza,  a  Hispanic  advo- 
r  organization.  “I  don’t  under- 
d  it.  We  had  an  undercount  of  5 
ent  of  the  Latino  population  in 
.  If  this  ruling  is  upheld  we  have 
eason  to  expect  anything  but  a  5 
:ent  undercount  this  time.” 
scause  of  large  concentrations  of 
ks  in  inner  cities  and  increased 
ibers  of  immigrants,  mainly  in 
Southwest  —  two  groups  that 
e  notoriously  low  response  rates 
nailed  census  questionnaires  — 
1990  census  was  the  first  in  mod- 
history  to  miss  more  people  than 
census  that  preceded  it. 
oday’s  ruling  turned  on  an  inter- 
tation  of  two  seemingly  contra- 
:ory  provisions  in  the  Census  Acts 
.957  and  1976.  The  1957  act  said 
t  “except  for  the  determination  of 
ulation  for  apportionment  pur- 


tively  removed  the  proh: 
against  it  for  reapportionmen 
poses.  Further  they  argued  t 
1976  Congress  added  amendnn 
the  act  that  directed  the  Com 
Secretary  to  conduct  a  census 
10  years  “in  such  form  and  c 
as  he  may  determine,  includi 
use  of  sampling  procedures  ai 
cial  surveys.” 

In  today’s  ruling,  written  by 
Lamberth,  the  panel  disagree 
.the  Justice  Department’s  rea; 

“Though  defendants’  intei 
tion  of  the  except/shall  se 
structure  is  proper  in  sot 
stances,  the  court  finds  it 
strained  and  incorrect  when  i 
to  the  amended  section  19 
wrote. 

Judge  Lamberth  stressed  i 
portance  of  the  census  to  the  1 
of  power  in  Congress  betwe 
ties  and  between  states,  and 
was  hard  to  imagine  that  G 
would  change  the  rules  with 
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(a)  “Newspaper”  image. 


(b)  Newspaper  kernels  (55  iter.) 


Figure  4.3:  Results  of  applying  the  Point  Coding  method  to  newspaper  images.  The  dictionary 
was  initialized  with  K  =  40  random  kernels  (only  9  are  shown)  of  size  8x8.  Kernels  are 
up-scaled  for  a  better  visualization;  actual  pixel  size  is  displayed  above  each  kernel  subplot. 
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(a)  Fingerprint  image. 
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(b)  “Fingerprint”  kernels  (31  iter.) 


Figure  4.4:  Results  of  applying  the  Point  Coding  method  to  fingerprint  images.  The  dictionary 
was  initialized  with  K  =  25  random  kernels  of  size  10  x  10.  Kernels  are  up-scaled  for  a  better 
visualization;  actual  pixel  size  is  displayed  above  each  kernel  subplot. 


Natural  Images.  We  first  apply  the  method  presented  here  to  images  from  the  Kyoto  natural 
image  database  [|T5|] .  The  dictionary  started  out  as  a  set  of  25  random  10x10  patches,  and  evolved 
as  some  of  the  kernels  grew  or  shrank.  The  entire  set  of  kernels  is  displayed  in  Figure  [4.2(b)|; 
the  learned  kernels  display  the  expected  aspects  for  such  a  dataset  (namely  edges,  ridges,  cross 
patterns)  but  they  also  include  other  shapes  ( e.g .,  “round”  edges).  One  interesting  aspect  is  that 
the  kernels  in  the  final  dictionary  (here,  after  300  iterations)  do  not  seem  extremely  sensitive  to 
the  starting  point. 

Newspaper  Images.  A  different  type  of  signal,  with  significantly  distinct  statistical  structure, 
is  the  class  of  scanned  newspaper  images.  It  is  a  main  characteristic  of  this  class  that  fewer 
predominant  orientations  are  present.  Figure  |4.3(b)|  exhibits  kernels  adapted  to  this  class,  after 
55  iterations  and  starting  from  random.  As  can  be  observed  easily,  kernels  tend  to  capture  mainly 
printed  symbols;  if  a  large  enough  set  of  kernels  is  used  (e.g.,  40),  they  tend  to  stabilize  to 
individual  characters,  or  pairs  thereof. 

Fingerprint  Images.  Finally,  we  apply  our  method  to  another  highly  distinct  class,  namely 
fingerprint  images.  Figure  [4.4(b)|  displays  a  set  of  kernels  learned  from  images  in  the  Cross 
Match  Verifier  300  sample  fingerprint  database  0-  Although  initialized  randomly,  after  only 
31  iterations  the  25  kernels  already  localize  in  frequency  and  in  orientation,  although  not  also  in 
space  (easily  explained  by  the  structure  of  the  signals). 


The  results  presented  in  this  section  have  been  learned  from  a  training  set  ranging  from  10 
images  (newspaper)  to  50  images  (natural).  The  dictionary  size  was  hand-picked  to  the  reported 
values  in  order  to  avoid  redundancy  (e.g.,  several  kernels  being  copies  or  shifted  versions  of  each 
other).  We  currently  are  working  on  designing  an  automatic  mechanism  to  control  the  number 
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of  “sufficient”  kernels. 


4.6  Conclusion 

We  proposed  an  approach  to  deriving  adaptive  shift-invariant  image  representation.  Our  method 
is  computationally  very  efficient  and  eliminates  kernel  size  constraints,  which  can  lead  to  a  gen¬ 
eral  adaptive  multiscale  dictionary.  In  the  kernel  update  step,  we  focused  on  what  we  believe 
to  be  an  under-explored  family  of  algorithms,  that  exploit  the  structure  of  the  least-squares  opti¬ 
mization  problem. 
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Chapter  5 

Robust  Coding 


5.1  Introduction 


Reliable  communication  over  noisy  channels  is  the  fundamental  problem  of  information  theory. 
The  abundance  of  practical  modern  applications,  such  as  the  communication,  compression,  or 
storage  of  data,  has  produced  many  variations  on  the  theme.  In  this  chapter,  we  focus  on  the 
problem  of  finding  linear  representations  that  optimally  preserve  information  in  the  transmitted 


signals  when  the  representation  has  limited  precision.  Introduced  by  E.  Doi  and  coll,  in  [j43|] 


and  further  analyzed  in  [|44||,  the  so-called  Robust  Coding  scheme  makes  use  of  arbitrarily  many 
coding  units  to  minimize  reconstruction  error,  by  explicitly  introducing  redundancy  in  the  code 
to  compensate  for  channel  noise. 

The  above  problem  was  pointed  out  to  be  of  particular  relevance  to  a  new  and  exciting  area: 
mathematical  modeling  of  neural  representations.  This  is  not  at  all  surprising;  cells  can  be  re¬ 
garded  as  communication  channels  for  traveling  neural  spikes  and  their  coding  precision  is  lim¬ 
ited  by  intrinsic  biological  constraints  (see  for  instance  H  [T§,  [43].  f46|]).  By  identifying  the  short 
time  activity  of  a  neuron  with  a  real  value,  the  limited  information  capacity  of  the  encoding  unit 
can  be  modeled  effectively  by  additive  Gaussian  noise.  This  abstraction  has  been  observed  to 
be  better  suited  for  neural  modeling  than  are  noise-free  representations,  employed  by  existing 
standard  linear  models  like  PCA  or  ICA;  nevertheless,  in  this  chapter  we  intend  to  focus  on 
theoretical  optimality,  rather  than  on  biological  plausibility  issues. 

Many  problems  tightly  related  to  Robust  Coding  have  been  studied  in  the  literature.  In  spite 
of  the  high  conceptual  similarity  of  these  problems,  optimal  solutions  depend  on  the  various  ob¬ 
jectives  and,  just  as  importantly,  on  the  particular  assumptions  or  constraints  involved.  These 
factors,  as  it  turns  out,  determine  both  the  structural  properties  of  the  solutions  and  the  com¬ 
putational  cost  of  obtaining  them.  One  instance  of  such  problems  appears  in  the  context  of 


communicating  real- valued  signals  over  parallel  independent  Gaussian  channels  ([[33|,  ch.  9]). 
There,  the  objective  is  optimal  power  allocation  for  maximizing  mutual  information  among  the 
altered  components  of  the  signal,  subject  to  an  average  power  constraint.  The  solution  essen¬ 
tially  depends  on  the  covariance  of  the  original  signal,  by  the  so  called  “water-filling”  procedure. 
This  problem  addresses  only  one  aspect  of  the  coding  (power  allocation),  which  restricts  the  so- 
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lutions  to  the  set  of  diagonal  matrices  with  nonnegative  entries  and  bounded  trace.  Instead,  it 
would  be  more  useful  to  investigate  more  general  linear  transforms,  notably  those  implemented 
by  non-square  dense  matrices. 

The  idea  of  improving  robustness  to  additive  noise  via  redundant  linear  transformations  has 
been  addressed  in  the  context  of  frames  by  Daubechies  PI],  where  the  optimality  criterion  was 
mean  squared  error  (MSE).  Several  classes  of  frames  have  been  identified  as  optimal  solutions  of 
other  problems  involving  the  design  of  linear  representations  that  are  resilient  to  various  types  of 
coefficient  alterations  (quantization  noise  [^],  erasures  |[57|.  p7|[),  or  that  have  optimal  numerical 
stability  of  reconstruction  [Q],  (For  a  recent  and  thorough  review  of  frame  properties  and  appli¬ 
cations,  we  recommend  Kovacevic  and  Chebira  [|72|,  |73|]).  In  general,  the  frame  design  problem 
can  become  rather  complex  and  often  times  algebraic  methods,  usually  employed  for  structural 
characterization,  must  be  complemented  by  numerical  methods,  as  argued  by  Dhillon  and  coll. 

Several  insightful  techniques  have  been  developed  in  search  for  accurate  characteriza¬ 
tion  and  interpretation  of  optimal  frame  representations  which  reduce  the  numerical  optimization 


burden  considerably  (see  Casazza  and  coll.  |[26|]). 

To  formally  define  our  problem,  let  us  consider  our  signal  to  be  made  of  samples  from  an 
TV-dimensional  zero-mean  data  distribution,  with  known  full-rank  covariance  matrix  Ex.  For  a 
given  number  M  of  communication  channels^],  we  shall  search  for  analysis  matrix  W  e  MMxJV 


and  synthesis  matrix  A  € 


p  NxM 


that  maximally  reduce  the  effect  of  additive  Gaussian  noise, 


independent  of  the  signal  and  having  the  same  power  aj  on  each  channel.  More  precisely,  our 
objective  is  to  minimize  the  reconstruction  MSEf]: 


tr  {(Ijv  -  AW)£X(I N  -  AW)T}  +  a] tr  {AAt} 


(5.1) 


The  signal  power  on  each  of  the  M  channels  also  may  be  assumed  to  be  identical  (call  it 
o'2),  via  a  simple  rescaling  of  W’s  rows.  This  implies  the  existence  of  a  common  signal-to-noise 
ratio  (SNR)  parameter  7 2  =  o'2 /o'2  >  0  to  characterize  the  precision,  and  consequently  the 
information  capacity  of  each  channel^.  In  the  following  we  will  try  to  simplify  the  optimization 
problem,  by  reducing  it  to  a  more  convenient  form,  a  first  step  of  which  is  to  eliminate  the 
correlation  in  the  signal.  By  the  eigendecomposition  of  the  covariance  matrix  £x,  we  obtain 
a  diagonal  spectrum  matrix  S2  with  positive  diagonal  entries,  and  an  orthogonal  matrix  E  6 
M.NxN,  such  that  £x  =  ES2ET.  By  the  changes  of  variable  T  =  WES/07,  and  B  =  ETA  we 
can  formulate  our  optimization  problem  as: 


min  IIBT-SII2  +  -MlBlI2 

B,T  MF  '  72M  MF 

s.t.  diag(TTr)  =  1M,\ 


(RCsimpie) 


'No  assumption  about  the  relation  between  M  and  N  is  explicitly  made.  We  shall  refer  the  case  M  <  TV  as 
undercomplete,  and  the  opposite  one  as  overcomplete. 

2Here  we  take  the  average  over  the  data,  as  well  as  over  the  noise. 

3 In  Cover  and  Thomas  [|33|],  information  capacity  is  defined  as  |  log (y2  +  1),  which  is  a  function  of  the  channel 
SNR  72. 
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The  new  variables  B  and  T  also  represent  synthesis  and  analysis  matrices,  while  the  capacity 
constraint  simply  translates  in  the  rows  of  T  being  unit-length  vectors  in  R  v . 

When  N  =  1  and  N  —  2,  a  complete  characterization  of  the  optimal  encoding/decoding  pair 
for  the  Robust  Coding  problem,  based  on  exhaustive  case  analysis,  was  presented  in  [[43].  [44]]. 
For  general  dimension  N,  numerical  solutions  were  obtained  via  gradient-based  optimization 
with  Lagrange  multipliers.  Although  such  solutions  are  perfectly  legitimate  for  many  optimiza¬ 
tion  problems,  we  could  not  provide  optimality  guarantees  as  precise  as  in  the  one-  and  two- 
dimensional  cases.  A  conjecture  on  the  error  function  lower  bound  was  given  in  [|44|| ,  yet  even 
though  the  result  was  strongly  supported  by  numerical  results,  a  proof  remained  out  of  reach. 

The  present  chapter  addresses  this  issue  by  extending  the  previous  analysis  to  the  most  gen¬ 
eral  case  (any  N  and  M).  Namely,  we  characterize  the  algebraic  structure  of  the  solutions  via 
singular  value  decomposition  (SVD;  see  [|56|,  |1 10|]).  We  also  prove  an  exact  formula  for  the  MSE 
lower  bound,  as  a  function  on  the  covariance  spectrum,  channel  SNR  72,  and  on  the  dimensions 
N  and  M.  This  exact,  therefore  tight,  bound  not  only  enables  us  to  easily  verify  optimality  of  a 
given  encoding/decoding  pair,  but  also  gives  insight  into  how  to  manipulate  the  parameters  ( e.g ., 
increase  the  number  of  units  M)  to  minimize  reconstruction  error.  Last  but  not  least,  structural 
characterization  leads  to  fast  and  direct  algorithms  for  obtaining  the  optimal  solutions,  which 
confers  an  immense  advantage  over  our  previous  approach.  Thus,  the  computation  becomes 
independent  on  the  notorious  step-size  sensitivity  of  gradient  methods,  both  from  numerical  sta¬ 
bility  and  from  computational  complexity  points  of  view. 

The  chapter  is  organized  as  follows.  In  subsection  [T2[,  we  identify  necessary  conditions  on 
the  optimal  T,  via  constraints  on  the  components  of  its  singular  value  decomposition  (SVD), 
and  compute  the  optimal  singular  values.  Also,  the  structure  of  the  right  singular  matrix  V  is 
described,  and  a  generic  procedure  for  computing  the  left  singular  matrix  U  is  analyzed.  Also, 
we  derive  the  closed-form  expression  of  the  lower  bound,  interpret  its  dependence  on  the  prob¬ 
lem  parameters,  and  illustrate  with  several  interesting  cases.  Section  [A4]  concludes  the  chapter 


by  summarizing  the  results,  by  comparing  our  study  with  related  approaches,  and  by  revealing 
several  directions  we  plan  to  explore  in  the  future. 


5.2  Robust  Coding  Solutions 

In  section  |5J]  we  provided  an  extended  motivation  for  the  Robust  Coding  framework.  Namely, 
it  models  the  problem  of  reliable  communication  on  parallel  Gaussian  channels  having  identical 
signal-to-noise  ratios.  Here,  we  address  the  properties  of  optimal  solutions  and  propose  efficient 
ways  of  computing  them. 

First,  we  shall  introduce  several  notions  that  we  intend  to  use  throughout  the  rest  of  this 
chapter.  We  will  show  how  to  effectively  reduce  the  parameter  search  space  to  the  set  of  unit-row, 
Af-by-Ar  matrices  T.  Then,  we  identify  necessary  conditions  for  the  optimality  of  such  matrices 
via  constraints  on  its  singular  vectors,  and  compute  the  singular  values  exactly.  This  will  enable 
us  to  exactly  describe  the  structure  of  the  right  singular  matrix  V,  and  identify  several  algorithms 
for  computing  the  left  singular  matrix  U.  Next,  we  shall  derive  the  closed-form  expression  of 
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Figure  5.1:  Image  coding  under  the  presence  of  channel  noise.  For  each  reconstruction  its  percent 
error  is  indicated,  (a)  Original  image,  (b)  PCA  (M=32)  with  noiseless  representation,  (c)  PCA 
(M=32)  with  1-bit  precision  code,  (d)  Robust  coding  (M=32)  with  1-bit  precision  code,  (e) 
Robust  coding  (M=64)  with  1-bit  precision  code,  (f)  Robust  coding  (M=5 12)  with  1-bit  precision 
code,  (g)  PCA  (M=64)  with  1-bit  precision  code,  (h)  ICA  (M=64)  with  1-bit  precision  code,  (i) 
Daubechies  9/7  wavelet  with  1-bit  precision  code. 


the  lower  bound,  interpret  its  dependence  on  the  problem  parameters,  and  illustrate  with  several 
interesting  cases. 

Definitions  and  notations 

Let  us  define  several  notions  we  shall  use  throughout  the  chapter. 
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Definition  2.  For  any  p  G  N*,  let  us  denote  Or,(W)  the  orthogonal  group  of  index  p,  i.e.,  the 
space  of  all  p  x  p  orthogonal  matrices. 

Definition  3.  For  any  p  G  N*  and  r  >  0,  letBp(0;r)  =  {x\x  G  Mp,  ||x||2  <  r}  the  p- dimensional 
zero-centered  unit  ball  and  <SP(0;  r)  =  dBp( 0;  r)  its  surface. 

Definition  4.  For  any  two  matrices  Mi  and  M2,  we  define  the  direct  sum  of  Mi  and  M2  as  the 
block-diagonal  matrix 

m2>  (5'2) 


Definition  5.  For  any  two  matrices  Mj  and  M2  of  the  same  size,  we  will  denote  by  Mi  o  M2 
their  Hadamard,  or  entry-wise  product. 

Definition  6.  Let  nbe  a  positive  integer.  For  any  permutation  r  of{  1,  2, ...,  n},  its  corresponding 
permutation  matrix  is  defined  as 


PM  =  j)tJ 


(5.3) 


where  5  is  Kronecker’s  symbol:  5ap  =  1  if  a  =  (3  and  0  otherwise. 

Remark.  Permutation  matrices  are  both  orthogonal  and  doubly-stochastic  /B2I /. 


We  will  now  show  how  to  further  simplify  the  Robust  Coding  optimization  problem,  as  well 
as  to  reduce  the  parameter  search  space.  For  the  sake  of  completeness,  let  us  review  the  assump¬ 
tions  and  conditions  imposed  on  the  parameters  appearing  in  the  simplified  form  flRCsimpic|). 


Let  M,N  G  N*,  7  >  0,  and  S  =  diag(si, ...,  sN),  where  s1  > 


generality.  As  explained  in  section 
the  cost  function 

£(B,T)  = 


we  want  to  find  B  G 


p  NxM 


|BT-S|||  +  — ||B 
T 


1 2 
I  Fi 


.  >  sN  >  0  without  loss  of 
T  G  RMx-Y  that  minimize 

(5.4) 


subject  to  diag(TTr)  =  1m, i-  We  can  regard  £  as  a  function  defined  on  RMN  x  5^(0;  1)M. 
This  is  possible,  because  the  constraint  diag(TTT)  =  1m, i  means  that  the  rows  of  T  are  unit- 
length,  Ar-dimcnsional  vectors  and  therefore  we  can  identify  the  set  of  feasible  matrices  T,  with 
5tv(0;  1)m  =  5at(0;  1)  x  . . .  x  5^(0;  1).  We  observe  that  for  any  fixed  T  G  5^(0;  l)^7,  function 
^T(-)  =  £f.  T)  is  a  quadratic  (therefore  continuously  differentiable)  function  of  the  entries  of 
B,  and  so  a  necessary  condition  on  B  to  minimize  £ t  can  be  expressed  in  matrix  form  as: 


Om,jv  =  ^t(B)  =  ^£(B,  T)  =  2  ((BT  -  S)  Tr  +  ^B^)  (5.5) 

which  further  implies  that 

B  ^IM  +  TTt^J  =  STt.  (5.6) 

Remark.  Matrices  ^  I m  +  TT  V  and  ^Iy  +  T  v  T  are  positive  definite  matrices  for  any  7  >  0 
and  T  G  Ma/xJV.  As  such,  they  are  invertible  and  det(4jlM  +  TTT)  >  0,  det^I^-fT7  T)  >  0. 
Consequently,  function  £T  has  a  unique  point  of  extremum,  namely 
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-1 


(5.7) 


Bt  =  ST7  M  +  TTT 

Since  St  •  RA/,V  — »■  M  is  bounded  below  by  0,  and  obviously  unbounded  above,  diis  extremum 
can  only  be  a  minimum.  Therefore  eq.  (|5.7|)  represents  a  necessary  and  sufficient  condition  for 
B  to  minimize  <5T. 

From  the  observation  above,  we  can  infer  (see  details  in  Appendix  |53|): 

S(B.  T)  >  £(Bt,  T)  =  tr  (s2(In  -  TT(^IM  +  ttt)_1T)^  (5.8) 

with  equality  if  and  only  if  B  =  BT.  This  allows  us  to  conclude  that  (B min,  T min)  is  a  min¬ 
imizing  pair  for  £  if  and  only  if  B min  =  BTmin  and  T min  is  minimizing  the  cost  function 

J-  :  £>at(0;  1)M  — >  M, 

F{T)  =  tr  (s2(IN  -  Tt(-^Im  +  TTt)_1T)^  =  tr  (S2(Ijv  +  72TrT)-1)  .  (5.9) 

where  the  last  equality  follows  by  applying  Sherman-Morrison- Woodbury  formula  ([[5^|,  p.50]; 
see  Appendix  |53]).  Since  T  is  a  continuous  function  defined  on  a  compact  set,  Weierstrass’s 
theorem  implies  that  it  reaches  its  minimum  on  its  domain,  which  means  that  there  exists  at  least 
one  such  optimal  matrix  T„im  and  moreover,  min  T  =  min£. 

T  B,T 

Let  us  notice  that  matrix  T  shall  serve  now  as  a  parametric  descriptor  of  our  optimal  analy¬ 
sis/synthesis  Robust  Coding  pair.  We  have  succeeded  in  effectively  reducing  the  search  space  to 
the  set  of  feasible  matrices  T.  Next,  we  need  to  describe  the  structure  of  function  T' s  minima. 

The  SVD  structure  of  Tmm 

In  this  section,  we  shall  identify  necessary  conditions  on  the  minimizers  of  fF,  after  which  we 
shall  point  out  which  of  these  are  also  sufficient.  For  this  purpose,  we  employ  the  singular  value 
decomposition  (SVD)  of  T.  Thus,  if  we  denote  K  =  min(M,  N),  there  exists  a  decomposition: 

T  =  U-E-Vt  (5.10) 

where  U  e  Om  (M),  V  e  (9tv(M),  and  S  e  MMy -v  a  diagonal  matrix  whose  diagonal  entries 
are,  without  loss  of  generality,  sorted  in  decreasing  order:  (T\>  . . .  >  crK  >  D.  Let  us  denote  by 
TtK  =  diag(cri, . . . ,  crK),  the  “reduced”,  square  version  of  S. 

Remark.  Matrices  T7  T  e  MjVxJV  and  TT7  e  MMxM  are  symmetric,  and  their  SVD  (the  same 
as  their  eigenvalue  decomposition )  is 

TtT  =  VE7SVt  =  V  (S^  ®  Ojv-ir)  VT,  (5.11) 

respectively 

TT7’  =  U£EtUt  =  U  (T?k  ©  0 m-k)  Ut.  (5.12) 
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Remark.  Since  T  E  Bjy( 0;  1)M,  a  necessary  condition  on  its  singular  values  is: 


K 


M  —  tv  (TTt)  =  tr  (Y2K  ®  0 m-k)  =  tr  Y2K  =  ^  cr2 


(5.13) 


i=  1 


Let  us  examine  now  the  cost  function.  By  algebraic  manipulations,  from  eq.  (|5T9|)  we  obtain 
(see  Appendix  |53|): 


F{T)  =  tr  S2V  (diag( 


'  1  +  7  2  of  1  +  72cr 


I n-k)  V7 


K 


(5.14) 


As  function  T  above  depends  only  on  and  V,  we  choose  to  refer  it  (by  a  slight  abuse  of 
notation)  as  J(Ex,  V).  Our  goal  now  became  to  characterize  the  minimizing  pairs  (YK.min,  Vmm) 
of  this  function.  First  of  all,  let  us  observe  that  for  any  Yk,  there  exists^: 


G(YK) 


min  jF(£/f,V) 

veOjvOR) 


(5.15) 


We  can  show  that  function  Q  is  invariant  to  the  ordering  of  the  entries  on  the  diagonal  of  its 
argument  (for  proof,  see  Appendix). 

Lemma  1.  For  any  permutation  matrix  P  E  RA  xA,  we  have  G(YK)  =  QiPY^P1 ). 

This  lemma  guarantees  that  condition  oy  >  . . .  >  <jk  imposed  on  the  diagonal  elements  of 
E k  does  not  restrict  the  generality  in  any  way. 

Remark.  From  the  conditions  imposed  so  far,  we  have 


0  <  i <  y - 5-2  ^ 

1  +  72a(  1  +  72ci2 

s2  >  s2  >  . . .  >  s2n  >  0. 


< 


1  +  72cr|: 


2  — 


<  1 


(5.16) 

(5.17) 


This  implies  that  the  two  diagonal  matrices  appearing  in  (|5.14|)  have  their  (positive)  diagonal 
elements  ordered  differently.  The  following  general  Lemma  will  help  us  find  the  exact  expression 
of  Q,  and  moreover  characterize  all  the  orthogonal  matrices  V  for  which  Q(YK)  =  T ( £  K ,  V). 
Lemma  2.  Letn  E  N*  a  positive  integer,  andF>A  =  diag  (ai,  a2,  •  •  • ,  an),  Db  =  diag  (6i,  62,  -  -  - ,  bn), 
two  diagonal  matrices  such  that  a±  >  . . .  >  an  >  0  and  hn  >  . .  .  >  h  \  >  0. 

n 

a)  Then,  min  tr  (D^VDbV1  )  =  tr  (D  iDb)  =  W  0*6*. 

VGO„(R)  v  ’ 

n  n 

b)  Let  Ti,  be  all  the  permutations  of  {1,  for  which  dibTk(i)  =  'ffafi,  1  <k< 

i= 1  i=l 

m.  Then\/  V  G  V  is  a  minimizer  of  tr  (DiVD/jV7 )  if  and  only  if  the  entry-wise 

product  Vo  V  is  a  convex  combination  of  the  permutation  matrices  P  (77),  1  <  k  <  m. 

Proof.  See  Appendix. 

4Function  T(Y,k,  •)  :  On  ^  M  is  continuous,  and  defined  on  a  compact  set. 


43 


The  above  lemma  guarantees  that  the  minimum  of  tr  (D/iVDbVt)  over  the  set  of  orthog¬ 
onal  matrices,  is  reached  for  the  identity  matrix.  This  minimizer  may  not  be  unique,  as  this 
property  should  depend  on  the  diagonal  values  as  well,  and  not  only  on  their  relative  order.  If  we 
substitute  by  S2,  and  D#  by  (IK  +  72£^)-1  ©  I n-k,  we  obtain: 


G(Vk)  =  min  r(i:K,v)  =  r(i:K,iN) 

VeOiv(R) 


N 


=  tr  (diag(s2, . . . ,  s2K)  ■  (Ir-  +  ry2H,2K)  X)  +  ^ 


i=K+ 1 


K 

1  +  7  2cr2 
2—1  ' 


N 

£ 

i=K+ 1 


(5.18) 

(5.19) 

(5.20) 


Since  function  Q  is  continuous,  we  can  guarantee  that  there  exists  a  minimizer  for  Q  on  the 
compact  set  of  all  diagonal  matrices  of  trace  M,  with  nonnegative  diagonal  entries.  Conse¬ 
quently,  ( £ K-min ■  V min)  is  a  minimizing  pair  for  T  if  and  only  if  £ K,min  is  minimizing  G,  and 
moreover,  min  T  =  min  Q. 

Sx,V _ 

From  eq.  (|5.20|),  we  observe  that  an  immediate  lower  bound  for  (/  ( £  /i )  is  the  sum  of  squares 
of  the  smallest  N  —  K  diagonal  entries  of  S.  Moreover,  optimizing  Q  is  equivalent  to  solving  the 
following  optimization  problem: 


{min 
s.t. 


GkW  1)  ■  ■  ■  )  &k) 

i= 1 

(Ti  >  .  .  .  >  (Jk  >  0 


where  we  denoted 


K 


Gk(cTi,  .  .  .  ,(Tk)  —  ^ 


2—1 


1  +  72cr2 


Fortunately,  this  problem  has  a  closed-form  solution. 


Theorem  5.2.1.  There  exists  1  <  R  <  K,  such  that  for  any  index  j  <  K, 


3 

E 

Si  >  .  t~1  „ ,  r  -v^  j  <  R. 


J  "  j  +  7 2M 

Then,  problem  (  P7|)  has  the  unique  solution 


R  (R  +  72M)  —  1,  ifl  <i<R 

i  sj 

if  R  +  1  <  i  <  K. 


Proof.  See  Appendix  pO| 


(Pi) 


(5.21) 


(5.22) 


(5.23) 
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The  above  theorem  provides  the  general  form  of  the  singular  values  of  Tmm.  At  a  closer 
examination  of  the  result,  we  observe  that  the  optimal  singular  values  depend  not  only  on  the 
dimensions  and  the  SNR  level,  but  also  on  the  concentration  of  the  eigenvalues  of  the  covariance 
matrix.  The  threshold  R,  the  index  of  the  smallest  nonzero  singular  value,  and  consequently 
the  rank  of  the  optimal  encoding  matrix  T  (or  W,  for  that  matter),  directly  reflects  a  degree  of 
spectral  concentration.  The  necessity  of  identifying  this  degree  would  easily  become  apparent 

K  K 

in  a  naive  attempt  to  minimize  the  sum  sf/(l  +  72af)  s.t.  °f  =  via  the  Cauchy- 

i—  1  i=  1 

Schwartz  inequality  (see  Appendix  |53|).  On  the  other  hand,  R  is  an  indicator  on  “how  much” 
and  ’’which  part  of  the  data  space  we  can  fill”  with  the  available  number  of  noisy  coding  units. 
It  therefore  determines  an  interesting  power  allocation  scheme,  somewhat  related  to  the  well- 
known  “waterfilling”  scheme.  Similarly,  the  subspaces,  or  directions,  lacking  significant  energy 
are  sacrificed  and  the  resources  are  devoted  to  the  more  important  ones. 


Finding  V,  the  right  singular  vector  matrix  of  Tmin 

In  the  proof  of  Lemma  [2]  we  used  Birkhoff’s  theorem  [[62].  p.  527]  stating  that  for  any  V  G  On, 
matrix  V  o  V  is  doubly  stochastic,  and  therefore  it  must  be  a  convex  combination  of  permu¬ 
tation  matrices.  Furthermore,  matrix  V  is  a  minimizer  for  ')  if  and  only  if  all  the 

permutation  matrices  just  mentioned  are  also  minimizers.  In  this  section,  we  will  address  the 
structural  characterization  of  the  matrices  minimizing  T(T, K,mim  •)>  which  are  actually  all  the 
possible  right  singular  matrices  of  T.mm .  As  in  the  previous  section,  we  shall  state  a  slightly 
more  general  result  (we  omit  the  proof). 

Lemma  3.  Let  n  efa positive  integer,  and  Da  =  diag  (ai,  ai,  ■  ■  ■ ,  an),  D b  =  diag  (bi,b2, . . . ,  bn), 
two  diagonal  matrices  such  that  >  . . .  >  an  >  0  and  bn  >  . . .  >  bi  >  0.  Let  us  denote  by 
(J2£  the  partition  of  {l, n}  determined  by  the  values  ai,a2,  ■  ■  ■ ,  an;  namely,  V  i  G  Tf  ,  j  G 
2 %  ,  i  <  j  we  have: 


(ai  =  aj  k\  —  kf)  and  ( a *  >  aj  <6>  k\  <  kf)  ■  (5.24) 

Let  f  be  the  partition  of  { 1, ...,  n},  determined  by  bi,  b2,  ■  ■  ■ ,  bn.  (In  the  definition  above,  we 
only  need  to  substitute  a’s  with  b’s,  and  ^  >  aj  with  bi  <  bj.)  Then  all  orthogonal  matrices 
V  =  (vtj)  x<l  ]<n  minimizing  the  function  trfD  i  VD /■; V7  )  share  the  same  support  structure 
determined  by  these  partitions: 

Vij  f  0  =>-  i  G  Tf  and  j  G  if,  andlf^lf  f  0.  (5.25) 


Consider  sequences  a  =  ( ai)i<i<jv  and  b  =  (&i)i<i<jv,  where 


CLi 


bi 


slfl<i<  N, 


1 


1+7  vr 

1, 


if  1  <  i  <  K 
if  K  +  1  <  i  <  N. 


(5.26) 


45 


(5.27) 


As  we  observed,  there  is  a  correspondence  between  contiguous  intervals  of  optimal  singular 
values  <7j  and  intervals  of  variances  s*.  This  correspondence  will  directly  influence  the  structure 
of  the  V  matrix.  We  shall  start  discussing  this  issue  by  means  of  observations  on  the  “optimal 
permutations”  P(r)  involved  in  the  convex  expansion  of  V  o  V.  Namely,  we  notice  that  such  a 

N  N 

permutation  t*  is  optimal  if  and  only  if  Y  a,bT(t)  =  Y  atb,,  which  is  equivalent  to  saying  that 

2=1  2=1 

the  ordering  of  the  values  of  b  remains  unchanged  via  permutation  r. 

In  the  following,  let  us  consider  the  partition  ( Jj)\<j<p+i  °f  { 1-  2, . . . ,  A'  },  where  Jp+\  = 
{K  +  !•••••  Ar},  and  the  rest  of  the  Jj'  s  defined  as  in  the  previous  section.  We  will  denote  bj 
the  (common)  value  of  //,  for  i  G  Jr  First,  let  us  notice  that  for  1  <  j  <  (3,  we  have  r{Jj)  = 
Jj,  that  is,  optimal  permutation  r  takes  interval  Jj  onto  itself.  An  immediate  consequence  of 
this  observation  is  that  all  the  corresponding  permutation  matrices  P(r)  have  a  block-diagonal 
structure,  of  whose  first  (5  —  1  diagonal  blocks  are  of  size  \Jj\  x  \Jj\.  In  turn,  this  will  imply 
that  matrix  V  o  V  has  this  property,  and  therefore  matrix  C  itself  will  have  the  same  structure. 
Namely,  we  showed  that  the  first  (3  —  1  blocks  of  C  are  matrices  from  0\jj |,  respectively.  What 
about  the  rest  of  Cl 

If  b0  =  1,  since  bp+ i  =  1  it  follows  that  however  we  permute  the  last  \Jp\  +  \Jp+\  elements 
of  b,  the  result  will  be  optimal.  To  conclude,  let  us  prove  that  any  orthogonal,  block-diagonal 


matrix  C  is  optimal. 

tr  )  =  tr  (D^i  ©  ...  ©  •  (V i  ©  . . .  ©  Vp)  (5.28) 

(Db,i  ©  ...  ©  D b,p)  •  (Vf  ©  ...  ©  Vj))  (5.29) 

=  tr  (D^iViD^iVf  ©  ...  ©  D^V^Db^VJ)  (5.30) 

Since  DB  j  =  bj  ■  Ij j.\,  1  <  j  <  (3,  and  DB>/J  =  b0  ■  for  each  j  we  have 

D,4jV jDejVj  =  bj  ■  DajVjVJ  =  bj  ■  D aj  =  D/ijDRj  (5.31) 

and  therefore  tr  (D^VD^V1 )  =  tr  (D  iD/j),  which  as  we  know  is  optimal.  Let  us  observe 
that  the  previous  analysis  works  almost  unaltered  for  the  case  when  N  <  M  (i.e.,  N  =  K ), 
regardless  of  the  value  of  bp.  (The  only  difference  is  that  Jp+\  =  0.) 


Let  us  analyze  the  case  when  bp  <  1.  If  N  >  M{=  K),  and  aM  >  om+i,  the  analysis  goes 
almost  exactly  as  before,  the  only  difference  being  in  the  structure  of  the  optimal  V.  Namely,  V 
would  be  block-diagonal,  having  (5  + 1  (orthogonal)  blocks  of  size  \J3\,  respectively.  An  interest¬ 
ing  case  occurs  when  N  >  M(—  K )  (i.e.,  we  are  in  the  “undercomplete  case”),  and  aM  =  «m+  i 
(i.e.,  all  the  singular  values  are  non-degenerate  and  M  “splits”  a  contiguous  interval  of  variance 
values).  We  can  assume,  without  losing  generality  that  (5  =  1,  otherwise  the  first  (5  —  1  blocks 
of  V  are  orthogonal,  as  before.  In  other  words,  our  problem  is  the  following:  given  a±  =  . . .  = 
ar  >  . . .  >  ajy  >  0,  and  b\  =  . . .  =  bp  <  bp+ 1  =  . . .  =  6jv  =  1,  with  r  >  p,  for  which  per- 

N  N 

mutations  r  of  the  indices  we  have  ai^r{i)  =  Yh  aA2  The  answer  is  quite  immediate,  namely 

2=1  2=1 

we  can  only  consider  permutations  for  which  the  smallest  p  values  of  b  are  paired  with  some  of 
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the  largest  values  values  of  a,  or  more  simply  stated  for  which  {1,2...,  p}  C  r  ({1,  2  . . . ,  r}). 
But  this  condition  is  equivalent  to  r  ({r  +  1, . . . ,  N})  p|  {1,  2  . . . ,  p}  =  0,  which  means  that  the 
lower  left  submatrix  of  P(r),  corresponding  to  rows  r  +  1, . . . ,  N  and  columns  1, . . . ,  p  is  null. 
But  then  V  o  V  and  V  will  also  have  this  property!  To  complete  the  analysis,  let  us  show  that, 
for  any  orthogonal  matrix  V  of  the  form 


V 


X  Y  \ 
0  Z  ) 


(5.32) 


we  have  tr  (D^VD^V7 )  =  tr  (D^Ds)  (here  X  e  Mrxp).  First,  let  us  observe  that  the  orthog¬ 
onality  of  V  implies 

XTX  =  IP,  XXT  +  YYT  =  Ir,  and  ZZT  =  IN_r.  (5.33) 


We  have: 

tr  (D^VDbV7)  =  tr  (  Da 


=  tr  (  Da 
=  tr  (  Da 


X  Y 

0  z 

bpX  Y 

0  z 


bp\ 


Itv-p 

XT  0 


XT 

Yr 


0 

ZT 


=  tr  D 


bpXXJ  +  YYt  YZt 
ZYt  ZZt 

(bp-l)XXT  YZt 

ZYt  0 


+  D. 


=  tr  Da  —  (1  —  bp)  tr  (DA/3XXt) 

Since  a\  —  . . .  —  ar,  and  r  >  p,  it  follows  that  DA)(g  =  ailr,  and  so 

tr(D.4VDBVr)  =  trDA-(l-^)tr(DAi/3XXr) 
=  trDA-a1(l-fy3)tr(XXT) 

=  tr  Da  —  ai(l  —'bp)  tr  (XTX) 

=  tr  Da  —  ai(l  —'bp)  tr  (Ip) 


N 

a,  -pai(l  -  bp) 

i— 1 
V 

-pai(l  -  bp) 

i=l 


Y 

i=l 


(1  -  bp) 


N 

yy  a* 

i=p+ 1 
N 

F  ^  a, 

i=p+l 


N 


bp  yy  <ii  +  yy  ^ 

i= 1  i=p-\- 1 

tr  (DaDb) 


(5.34) 

(5.35) 

(5.36) 

(5.37) 

(5.38) 

(5.39) 

(5.40) 

(5.41) 

(5.42) 

(5.43) 

(5.44) 

(5.45) 

(5.46) 

(5.47) 
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Algorithm  1  Compute  U,  the  left  singular  matrix  of  Tmin. 

Let  U()  e  Om  arbitrary. 

Q  4=  Uo  ( ^K,min  ©  ®M~k) 

for  i  =  1  to  M  —  1  do 

Find  indices  j,  k  such  that  (1  —  q:j] )  ( 1  —  qkk)  <  0;  otherwise,  stop. 
Let  6i  e  [0,  27t),  such  that  q'--  =  1,  where  Q'  =  G Q  G j,k{^i)T 

Ui  4=  Gj)fc(0,;)Uj_1 

Q  -<=  Q; 

end  for 

Output  U  =  U, 


In  conclusion,  in  the  case  when  N  >  M  and  =  aM+ 1,  matrix  V  is  block-diagonal,  with 
the  first  (3  —  1  blocks  orthogonal  matrices,  and  the  / 3th  block  of  the  form  shown  in  eq.  (]5.32[).  Our 
attempt  to  characterize  the  right  singular  vector  matrices  of  Tmm  is  now  complete. 

Finding  U,  the  left  singular  vector  matrix  of  Tmin 

In  this  section  we  shall  describe  how  to  compute  the  left  singular  matrix  of  Tmm  when  we  know 
the  optimal  singular  values  (i.e.,  HKmin).  As  we  noticed  already,  this  is  needed  only  to  satisfy 
the  constraint  and  does  not  influence  the  cost  function.  Namely,  our  goal  will  be  to  search  for 

U  G  0m(K),  such  that  diag  (U  (E^min  ©  0M-rr)UT)  =  lM,i- 
Remark.  For  any  U  e  we  have: 

tr  (U  rslmin  ffi  0„_k)Ut)  =  tr  (TTr)  =  M  (5.48) 

Using  this  observation,  we  shall  provide  a  very  simple  and  efficient  way  to  obtain  the  U 
matrix,  based  on  Givens  rotations.  We  illustrate  this  procedure  (hereby  referred  as  Algorithm 
i©2j,).  and  then  prove  its  correctness. 

Lemma  4.  Algorithm  (572]  computes  a  matrix  U  G  Om,  satisfying  the  constraint  condition 

diag  (u  ©  0m-k)3Jt )  =  1m,  1-  (5.49) 

Proof.  See  Appendix. 

As  it  can  easily  be  observed,  there  may  be  multiple  ways  to  “fix”  the  initial  matrix  Uo  into 
an  acceptable  solution.  Let  us  also  remark  that  the  procedure  just  described  remains  virtually 
unchanged  if  we  work  with  the  incomplete  SVD,  rather  than  with  full  SVD,  in  the  overcomplete 
case  (M  >  N).  The  only  difference  is  in  the  choice  of  the  starting  point  Uo  G  MJ'/xAr,  having 
an  orthonormal  set  of  columns.  Unfortunately,  this  observation  does  not  essentially  reduce  the 
computational  complexity  of  finding  matrix  U,  which  is  still  going  to  be  0(M2). 

Error  Bound 

It  is  useful  to  know  what  is  the  theoretical  limit  of  the  cost  function.  To  find  the  exact  formula  for 
the  lower  bound,  we  first  need  to  identify  the  rank  R  of  the  encoding  matrix,  that  is  we  need  to 
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Figure  5.2:  Robust  Coding  solutions:  computed  without  additional  constraints  (left),  with  ”spar- 
sity”  constraints  (center),  and  with  “locality”  constraints  (right).  [Courtesy  of  E.  Doi.] 


check  until  when  does  the  non-degeneracy  condition  hold.  To  do  this,  binary  search  is  sufficient, 
and  so  we  can  find  R  in  0(log  M)  time.  Once  it  is  known,  we  can  express  the  closed-form  lower 
bound  of  £,  as  demonstrated  by  the  following  lemma. 

Lemma  5.  The  minimal  value  of  the  cost  function  £  is 


min  £ 


R  +  72M 


N 

+ 

i=i?+l 


(5.50) 


Proof.  See  Appendix. 


This  result  proves,  as  well  as  improves  our  conjecture  in  [|44|]. 


5.3  Robust  Coding  Algorithms 

We  have  described  the  optimal  solutions  of  the  Robust  Coding  problem  when  the  only  constraint 
was  that  the  SNR  of  the  channels  was  identical,  arguing  that  this  should  lead  to  a  more  realistic 
model  for  neural  encoding.  As  we  pointed  out,  any  solution  of  the  form  described  in  section  |A~2|  is 
equally  good,  although  it  is  not  obvious  what  kind  of  image  structure  do  these  optimal  solutions 
represent  (see  figure  |5T2|). 

The  most  natural  explanation  for  this  lack  of  structure  is  the  fact  that  we  only  focused  on  one 
biological  constraint  that  accompany  neural  encoding.  Other  constraints,  such  as  the  sparsity  of 
the  distribution  of  encoding  coefficients,  known  to  hold  in  real  biological  systems,  should  also 
be  taken  into  account. 
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This  motivation  has  lead  to  more  structured  optimal  RC  solutions  in  [[f7|].  There,  by  extending 
the  RC  model  to  account  for  optical  blur,  as  well  as  for  sparsity  of  the  neural  activity,  they 
managed  to  produce  optimal  solutions  closely  resembling  receptive  fields  of  retinal  ganglion 
cells.  For  the  computation,  they  used  gradient  descent  with  Lagrange  multipliers,  thus  grouping 
all  the  constraints  and  the  objective  function.  Although  the  optimization  converged,  the  number 
of  iterations  was  considerable,  even  for  a  relatively  moderate-size  problem.  We  will  focus  more 
on  the  computational  side  of  their  problem.  Namely,  we  are  interested  in  deriving  algorithms  that 
take  advantage  of  the  known  structure  of  the  Robust  Coding  optimal  solutions  to  accommodate 
additional  constraints.  The  obvious  advantage  of  such  an  approach  is  the  significant  reduction  of 
the  (structured)  search  space. 

In  our  case,  since  the  only  set  of  parameters  needed  to  differentiate  among  all  RC  solutions  is 
the  set  of  left  singular  vectors  (an  orthonormal  matrix),  we  propose  using  optimization  algorithms 
on  Stiefel  and  Grassmann  manifolds  (for  a  comprehensive  description  and  categorization  of  most 
widely  used  algorithms  of  this  kind,  see  [j?|]).  The  idea  of  using  a  “natural  gradient”  (as  it  is 
sometimes  referred)  has  lead  to  what  is  currently  state-of-the-art  stochastic  gradient  approach  to 
ICA  ([]30|,  [66jl ),  but  has  been  applied  to  many  other  problems  (see  [[7]]  for  a  thorough  review,  and 
|[|]  for  a  very  good  monograph). 

We  recently  found  that  variants  of  the  (unconstrained)  Robust  Coding  problem  have  been 
known  already  in  the  information  theory  literature.  The  most  similar  to  our  analysis  is  the  one 
presented  in  [|77|,  [7Kj].  Our  structural  characterization  of  the  optimal  encoder  matrices  is  (in  our 
opinion)  more  explicit  and  elegant.  Regarding  the  algorithm  for  computation  of  the  left  sin¬ 
gular  matrix  U,  we  learned  that  it  has  been  rediscovered  at  least  two  more  times  since  [|78|], 
being  included  in  [|T5l]  and  |[27]].  Fortunately,  we  can  now  take  advantage  of  recent  mathemat¬ 
ical  advances  such  as  optimization  algorithms  on  manifolds  to  identify  algorithms  with  better 
computational  and  numerical  properties.  Applicability  of  the  general  theory  of  optimization  on 
Stiefel  and  Grassmann  manifolds  was  eased  significantly  by  the  implementation  and  release  of 
the  sg  MATLAB  package  (see  [[82|]).  We  feel  that  by  adapting  such  power  to  the  specifics  of  Ro¬ 
bust  Coding  (and  its  generalizations)  we  can  advance  our  understanding  about  this  fundamental 
problem. 


5.4  Discussion 

We  have  provided  a  theoretical  analysis  of  Robust  Coding  solutions  in  the  general  case:  arbitrary 
dimension,  arbitrary  number  of  encoding  units,  arbitrary  precision,  and  spectrum  of  the  data 
covariance  matrix. 

A  consequence  of  our  result  is  that  we  can  manipulate  the  parameters  to  arbitrarily  reduce 
the  error.  For  instance,  one  possible  intuitive  interpretation  of  the  formula  is  that  by  increasing 
the  redundancy  (i.e.,  M),  we  can  compensate  for  the  limited  coding  precision  and  thus  we  can 
overcome  the  effect  of  noise  in  the  representation.  Alternatively,  for  a  fixed  number  of  encoding 
units  a  higher  capacity  will  result  in  a  lower  erroif], 

5  We  should  emphasize  that  the  “break”  index  J  is  dependent  of  both  M  and  7.  However  its  effect  does  not  affect 
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By  keeping  our  model  flexible  enough  to  handle  arbitrary  (i.e.,  non-isotropic)  covariance  also 
implies  that  Robust  Coding  optimal  encoder/decoder  are  not  necessarily  tight  frames.  Indeed, 
because  of  the  way  index  R  is  defined  we  notice  that  the  optimal  matrices  may  not  have  full 
rank.  This  is  to  say  that  we  first  select  a  subspace  most  relevant  for  the  data  distribution,  and 
then  allocate  all  the  representation  resources  (encoding  vectors)  to  span  this  subspace  as  well  as 
possible.  Such  an  interpretation  is  related  to  the  one  in  [p6|],  however  it  is  more  general. 


5.5  Appendix 

Reformulate  cost  function  8.  We  show  how  the  cost  function  is  transformed  when  we 
plug  in  the  expression  of  the  optimal  decoding  matrix  (BT  from  eq.  (5771))  into  eq.  (|574|): 


£(Bt,T)  =  ||BtT-S||^  +  —  ||Bt||| 


7 


tr  ((BtT  -  S)(BtT  -  S)T)  +  4tr  (BtB£) 


7 


=  tr  I  Bt(— Im  +  T'T  )B4  +  S' -  STj  B^,  -  BtTS 

=  tr  (STrB^  +  S2  -  STrB^  -  BtTS) 

=  tr  ((S  —  BtT)S) 

=  tr  (S(S  —  BtT)) 

tr  (s(S  -  STt(-^Im  +  TTt)-1T 


7 


=  tr  S2(I/v  —  Tr(— I 


7 


TT 


T\- lr 


(A-l) 

) 

(A-2) 

(A- 3) 

BtTS^) 

(A-4) 

(A-5) 

(A-6) 

(A-7) 

(A-8) 

(A-9) 

Sherman-Morrison- Woodbury  formula. 

Proposition  1  (|]56(1).  Let  X,  Y  e  C  e  RN,  such  that  both  C  and  1M  +  YrC“1X  are 

nonsingular.  Then  (C  +  XY1 )  is  also  nonsingular  and 

(c  +  xyt)_1  =  cr1  -c~1x(im  +  ytc~1x)~1ytc-1.  (a-io) 

By  plugging  C  =  In,  and  X  =  Y7  =  yT7  into  the  identity  above,  we  obtain: 

(I^  +  72TtT)-1  =  IJV-TT(4lM  +  TTr)”1T.  (A- 1 1 ) 

7 

consistency  of  the  interpretations. 
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Reformulate  cost  function  T. 


^(T)  =  tr  (S2(Ijv  +  72TtT)-1) 

=  tr  (S2(VVT  +  72V  (£2^  ®  0N_K)  Vt)-1) 
=  tr  (S2V(I*  +  72(£2f  ©  (W)))-1^) 

=  tr  (S2V  ((I*  +  7 2SiO_1  ©  I n-k)  Vt) 


=  tr  S^V  (diag( 


1  +  i1o\  ’  "  ’  ’  1  +  72cr| 


In-k)  Vj 


K 


(A- 12) 
(A- 13) 
(A- 14) 
(A- 15) 

(A- 16) 


Proof  of  Lemma  jl|.  Fix  arbitrary  permutation  matrix  P,  and  define  matrix  Pjv  =  P©Ijv-ir- 
Then, 


g(P£A-PT)=  min  ^(P£aPt,V)=  min  TfSjf.VPJ 

VgOjv(R)  VeOjv(R) 


min  .F(£*,V')  =  0(£*)- 

V'gOjv(R) 

(A- 17) 


Proof  of  Lemma  |2j.  Due  to  the  inverse  ordering  of  elements  on  the  two  diagonals,  by  Hardy- 
Littlewood-Polya  rearrangement  lemma  we  know  that  for  any  permutation  7r  of  {1,  2, . . . ,  n}  we 
have: 

n  n 

a  A .  (A- 18) 

i— 1  i=  1 

Let  us  inspect  the  function  we  intend  to  optimize: 

n  n 

tr  (DaVDbVt)  =  tr  (Df  VDflVTDf )  =  ||Df  VD^/2|||  = 

i= 1  j= 1 

=  aT(V  o  V)b 


where  a  =  (al7 . . . ,  an)T ,  b  =  (&i, . . . ,  bn)T . 

Remark.  For  any  orthogonal  matrix  V  =  (v,f)  e  On(W),  the  Hadamard product  Vo  V  =  (v2) 
is  a  doubly  stochastic  matrix.  This  property  is  simply  a  restatement  of  the  fact  that  the  rows  and 
columns  of\  are  unit  length  vectors. 

Birkhoff’s  theorem  [|62|,  p.  527]  states  that  for  any  n  x  n  doubly  stochastic  matrix  Q  (in 
particular,  for  Q  =  V  o  V),  we  can  write  Q  as  convex  combination  of  permutation  matrices. 

m 

In  other  words,  there  exist  m  e  N*,  nonnegative  coefficients  07, . . . ,  am  with  ak  =  1,  and 

k= 1 

m 

permutations  Ti, . . . ,  rm  of  {1,  2, . . . ,  n)  such  that  Q  =  0^(77).  Consequently, 

k= 1 


tr  (DaVDbVt)  =  aT  ^«fcP(rfc)  b  =  ^  [afc  (arP(rfc)b)]  =  ^ 


,k=  1 


k= 1 


k= 1 


Oik 


m 


>-Y. 


n 

22 aibi  =  tr  (DaDb)  . 

i=l 
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The  cost  is  minimized  for  those  and  only  for  those  orthogonal  matrices  V  such  that  V  o  V  is  a 

n  n 

convex  combination  of  matrices  corresponding  to  permutations  r  for  which  atbTy)  =  aA  - 

i=  1  i= 1 

Consequently,  we  obtain  the  desired  result. 


Proof  of  Theorem  |5.2.1|.  Let  J  =  K.  Let  us  characterize  the  ensembles  (cr,:)  l<i<J  that 


minimize 


Gj(cri, . . . ,  <Jj)  =  Y 


Tt  1  +  72°f 


(A- 19) 


j 

subject  to  of  =  M ,  and  oy  >  . . .  >  aj  >  0. 

1=1 

If  J  =  1,  then  the  solution  is  obvious:  there  is  only  one  nonnegative  value  (namely  oy  = 
\/M)  satisfying  the  constraint  and  therefore  it  will  also  be  optimal.  If  J  >  1,  we  need  a  more 
elaborate  way  of  analysis  and  for  this  purpose  we  shall  use  an  auxiliary  lemma  which  will  allow 
us  to  relax  the  constraints,  without  critically  influencing  the  optimal  solutions. 

Lemma.  With  the  above  notation,  we  have: 


min  <3j{(7\,  . . . ,  oj)  =  min  £j(oy, . . . ,  a  j)  (A-20) 

E  <y}=M  E 

i= 1  i=  1 

Proof.  It  is  sufficient  to  prove  that  the  minimum  value  on  the  left-hand  side  (i.e.,  on  the  more 
restricted  domain )  is  no  larger  than  the  one  on  the  right-hand  side.  First,  let  us  observe  that  Qj 
is  a  continuous  function.  If  we  define  this  function  on  the  compact  Sj(  0;  \/M)  (the  surface  of  a 
ball),  Qj  reaches  its  minimum  at  a  point  (x] . . . . ,  xj)T  G  Sj( 0;  s/M).  But  since  Qj  is  even  in 
each  of  its  arguments,  we  have 

Qj(xi, . .  ■ ,  xj)  =  Gj(\x1\,...,\xj\)  (A-21) 


and  so  we  can  assume  without  losing  generality  that  xt  >  0,  1  <  i  <  J. 

Moreover,  Qj(x\ , . . . ,  xf)  >  Q j(xT(i), . . . ,  xr(j))  with  equality  if  xT(i)  >  ...  >  xT(jy  This 
means  that  we  can  also  assume  without  losing  generality  that  X\  >  ...  >  xj  >  0.  Thus  the 
lemma  is  proved. 

We  transformed  our  problem  into  minimizing  a  continuous  function  on  a  less  restricted  do¬ 
main,  which  will  help  us  describe  the  optimal  ensembles  in  a  simple  fashion.  We  will  employ 
a  slightly  different  parametrization  of  Qj,  which  not  only  embeds  the  constraint,  but  also  allows 
us  to  optimize  a  continuously  differentiable  function  defined  on  a  closed  ball  (as  opposed  to  the 
surface  of  such  a  ball). 

Namely,  let  TLj  :  £?./_ i(0;  s/M)  — >  (0,  oo), 


j- i 

HJ(a1,...,aJ- 1)  =  Y 

i= 1 


1  +  7  V2 


+ 


1  +  7 2  (  M  -  Y=i  T 


(A-22) 
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which  is  a  continuous  function  on  the  compact  Bj-\  (0:  y/M),  and  therefore  has  at  least  a  mini¬ 
mum  point  (E)  1<?;<J  on  this  set.  Such  a  point  can  lie  either  on  the  border  of  the  ball,  or  on  the 

j- 1 

interior.  The  first  case  implies  that  E  &j  —  and  so  a  j  =  0.  The  problem  is  then  reduced  to 

i—  1 


J- 1 

finding  the  minimum  of  the  analogously  defined  cost  function  Gj-i,  subject  to  E  ai  —  M. 

i= 1 

As  the  sum  of  of ’s  is  M,  there  must  exist  an  index  i  for  which  dy  7^  0.  We  can  assume 
without  losing  generality  that  J  is  the  largest  such  index.  As  noticed  before,  if  J  =  1,  we  are 
done.  Otherwise,  the  minimum  point  (oh, . . . ,  ctj- i)  of  7ij  lies  on  the  interior  of  B j_i(0;  VM). 
From  Lemma  |5.5|,  we  can  assume  that  the  optimal  ensemble  satisfies  o\  >  . . .  >  o j  >  0. 

Since  Tt  j  is  differentiable  on  the  interior  of  its  domain,  it  follows  that  for  all  1  <  i  <  J: 


dHj 

dai 


[1  +  72bf]2 

( 


+ 


2crj72Sj 


j-i 


7=2 


1  +  72  [  M  - 


3= 1 


-2o-i72 


[1  +  72cL2f 


V 


J- 1 


1  +  72  I  M  -  E 


3= 1 


\ 


and  the  gradient  is  null  in  (a1, . . . ,  crj_i).  We  obtain 


(A-23) 


(A-24) 


[1  +  72df]2 


j-i 


1  +  72  I  M  -  Y, 


3  = 1 


[1  +  7  M]2 


,  V  1  <  i  <  J 


and  consequently 


(A-25) 


j 

E  si 

i=l 


J 

E 

i— 1 


1  +  72df 


J  +  72  E  a 


^2 

i 


J  +  72M 


,  V  1  <  i  <  J 


i=  1 


—j— —  ( J  +  72M)  -  1,  V  1  <  i  <  J 

E 


(A-26) 

(A-27) 


We  observe  that  the  existence  of  an  interior  minimum  point  depends  on  how  — ^ —  ( J  +  72M) 

Li=l  si 

compares  to  1.  This  is  the  same  as  saying  that,  in  order  for  such  a  solution  to  exist,  not  even  the 
smallest  among  the  st  should  be  too  much  smaller  than  their  “average’’^. 

6  We  used  quotes  here  because  the  lower  bound  is  strictly  smaller  than  the  actual  average. 
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(A-28) 


Let  us  denote  by  R  the  largest  index  J  <  K  such  that 

sj 


Ei= i 


(J  +  72M)  >  1. 


Such  an  index  indeed  exists,  as  for  J  =  1  the  above  condition  is  satisfied,  and  the  set  {1, . . . ,  K} 

is  finite.  We  shall  prove  now  that  for  any  index  J  <  R,  it  too  satisfies  condition  (|A-28|). 

j 

Denote  tj  =  E  1  —  2  —  AT.  As  observed  before,  the  case  J  =  1  is  clear.  Assume  J  >  1. 

1=1 

Then  condition  (|A-28|)  is  equivalent  to 

tj  . ,  ,  ,  ,  tj  , .  /(  ,  \  /  T  ,  2 


sj  > 


tj  tj— i  > 


J  +  7  2M 

<s=>  tj(J  —  1  +  7 2M)  >  t j-i(J  +  72M) 


J  +  72M 

2j/\  ,,  tj 


O  (tj  -  tj- i)(J  +  7  M)  >  tj 
tj-i 


> 


J  +  72M  J-1  +  72M' 


But  then,  since  the  s,  values  were  ordered  in  decreasing  order,  we  have: 


sj— i  >  sj  > 


tj 


> 


tj-i 


J  +  72M  J  -  1  +  7 2M 


which  proves  that  if  J  satisfies  (|A-28|),  then  so  does  J  —  1.  Thus,  we  showed  the  existence  of  an 
index  R  as  in  the  statement  of  the  theorem. 

It  follows  immediately  that  the  optimal  values  a*  are 


,  +  ifl  <i<R 

ai  —  {  7  \  tR 


(A-29) 


0, 


if  R  +  1  <  i  <  K. 


We  can  easily  verify  that  both  constraints  on  the  <7*  are  verified  by  the  values  above,  and  thus  the 
theorem  is  completely  proved. 


K  K 

Cauchy-Schwartz  fails.  Let  us  try  to  minimize  E  s?/(l  +  72of )  s-h  E  af 

i= 1  i=l 

the  Cauchy-Schwartz  inequality: 


M,  using 


(A-30) 


with  equality  if  and  only  if  either  all  6,;’s  are  zero,  or  if  there  exists  a  real  constant  p,  such  that 
cii  —  pbi,  VI  <  i  <  n.  In  our  case,  this  becomes 


k  2 
S7 


K 


K 


E 


1  +  72a2 

=  1 


EP  +  tV?)  >  [-£. 


,  i=l 


,  i=l 


K  2 

sj 


E 


E  (i  + 


> 


2=1 


AT  +  72M 


i  1  +  72°f  / 


K  \  2 

Esi) 

Es> 

.*=1  / 

\*=1 
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2  • 


(A-31) 


(A-32) 


We  may  be  tempted  to  say  that  we  have  equality  in  the  relation  above,  if  and  only  if: 


.  =ct.,  VI  <i  <K.  (A-33) 

V1  +  7  2°i 

Unfortunately,  this  is  not  always  true,  as  this  condition  would  necessarily  imply  the  condition 
in  eq.  (|A-27|).  Nor  should  it  be  so,  since  we  are  overconstraining  the  quantities  involved  in  the 
Cauchy-Schwartz  inequality  (namely,  we  impose  that  all  \bt\  be  supraunitary). 


Proof  of  Correctness  for  Algorithm  |572| 

First,  we  need  to  prove  that  it  is  indeed  possible  to  construct  the  matrices  Uz  mentioned 
above.  Assume  we  are  at  the  ith  step  in  the  loop  (1  <  i  <  M  —  1).  Namely,  we  know  that  the 
first  i  —  1  diagonal  entries  of  matrix  Q,  i  =  Uj_i . . .  U0(£^min©OM-ic)Uo  •  •  •  are  equal 
to  1.  We  will  search  for  the  orthogonal  matrix  U,;  such  that  the  first  i  —  1  diagonal  entries  of 
Qi  =  U,Q,._|Uf  remain  unchanged,  while  the  ith  entry  becomes  1.  We  will  restrict  our  search 
to  orthogonal  matrices  of  a  particular  form,  namely  to  Givens  rotations: 


\ 


cos  9 

sin  6 

—  sin  6  ■  ■ 

■  cos  9 

\ 


l) 


(A-34) 


where  1  <  a  <  (3  <  M,  and  9  e  [0,  ^ r).  In  our  particular  case,  we  shall  take  U,:  =  G*  ^(6^)  and 
we  show  how  to  find  parameters  3i  and  6,  such  that  the  invariant  is  satisfied. 

To  simplify  the  description,  let  us  denote  Q;  =  (  q-l j  _ .  Then,  by  assumption  we 

V  /  j,k=l,M 

have  qfj  =  1,  V  1  <  j  <  i.  Let  us  observe  that  due  to  the  particular  form  of  U,,  we  have 
q(j‘i  =  (Ijj  1,5  7  1  <  j  <  M,  ?  /  j  /  3r.  In  the  simplest  case  (namely  if  q\l~^  =  1),  we  take 
6i  =  0,  a  =  i  and  (3  =  i  +  1,  which  gives  U*  =  IM.  Then  q:^  —  1,  V  1  <  j  <  i.  Now,  assume 

M 

without  loss  of  generality  that  <  1.  Since  M  =  tr  Q,_i  =  W  q  1  -1 ,  it  follows  that  there 

3= 1 

exists  an  index  t  +  1  <  3t  <  M,  such  that  >  1.  We  will  search  now  for  0,  such  that 
q^  =  1.  Let  g  e  MM  be  defined  by 


9j 


cos  9i,  if  j  =  i 
sin  0i,  if  j  =  3i 
0,  otherwise. 
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(A-35) 


(A-36) 


(In  other  words,  vector  g  is  the  transposed  of  the  z1h  row  of  matrix  U,.)  Then: 

Qii  =  grQ*-ig 

(<-l)  1) 


=  (  ^ >  (  &  It  )  (  ^  ) 

-  (  COS0-  sin  0)(  ^cos^  +  ^sin^i 

-  (  cos  (7,  Sin J  (i_i)  (i-l)  .  Q 

=  ^r1}  cos2  +  2g^“1)  cos  Oi  sin  Oi  +  q ^  sin2  Oi 


(A-37) 


(A-38) 


(A-39) 


(0 


We  notice  that  sin  Oi  =  0  will  not  give  an  acceptable  solution,  since  then 
Therefore  we  can  assume  sin  0%  ^  0.  Then 
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(i-i) 


<  1. 
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=  Oil  cot2  Oi  +  Zqip.  cot  Oi  +  qpi/3. 

Now,  all  need  to  do  is  to  solve  a  quadratic  equation  in  t  =  cot  0, : 

(1  -  qtl)t2  -  2  q%ft  +  1  -  =  0 


(A-40) 

(A-41) 

(A-42) 
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Since  we  assumed  1  —  q}]  ’  >  0  and  1  —  g^./3,  ’  <  0,  it  follows  that  the  discriminant  is 
positive: 

(A-45) 
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Proof  of  Lemma  g  (Error  lower-bound).  Let  us  recall  that 


min  MSE  =  min  £  =  min  T  =  min  T  =  min  Q 

B.T  T  £k,V  £k 


(A-46) 


where  in  each  case  we  assumed  the  appropriate  constraint.  Due  to  eq.  (|5.20D  and  to  the  formula 


fl5.23|)  on  the  optimal  singular  values,  we  have: 
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This  gives  an  exact  formula  for  the  lower  bound  of  the  error  function. 
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Chapter  6 
Conclusions 


We  have  studied  several  ways  in  which  related  existing  frameworks  for  signal  representations, 
particularly  those  pertaining  to  visual  signal  encoding,  can  be  extended  and  improved.  Since  the 
description  of  a  signal  or  of  an  entire  class  of  signals  is  intrinsically  suboptimal  if  it  does  not 
account  for  statistical  properties  of  that  class,  we  embrace  the  adaptive  encoding  point  of  view 
as  essential  to  an  optimal  design.  The  main  challenge  is  that  in  so  many  cases  adaptivity  is  too 
impractical  to  employ,  and  therefore  trading  optimality  for  computation  might  not  necessarily 
be  worth  it.  We  beg  to  differ.  As  such,  we  suggested  several  ways  in  which  the  particularity  of 
the  problem  can  be  exploited  to  allow  for  practical  adaptive  solutions.  In  each  of  the  instances 
hereby  studied,  we  follow  two  main  goals:  to  clearly  identify  the  theoretical  principles  which 
govern  the  representation’s  optimality,  and  to  convey  the  most  efficient  algorithm  to  compute  it. 

Multiresolution  ICA.  We  designed  and  implemented  a  hybrid  multiresolution  adaptive  method 
(MrICA)  for  image  encoding.  We  demonstrated  that  it  combines  the  advantages  of  multireso¬ 
lution  methods  (representational  power,  and  computational  efficiency)  and  of  adaptive  methods 
(statistical  optimality),  thus  improving  over  both  classes  of  representations.  We  illustrated  the 
practical  merits  of  MrICA  (specifically,  its  coding  efficiency)  by  direct  comparison  with  the  cur¬ 
rent  image  coding  standard  for  images  JPEG2000.  The  new  method  demonstrated  that  for  a  large 
range  of  encoding  rates,  the  average  quality  of  the  reconstruction  (both  perceptual,  and  measured 
by  SNR)  is  significantly  better  than  that  of  JPEG2000.  This  strongly  supports  the  idea  of  using 
adaptivity  as  a  source  of  practical  improvement  for  modem  image  coders. 

Point  Coding.  Existing  approaches  for  the  adaptive  sparse  encoding  of  large  signals  are  known 
to  involve  significant  computational  costs.  A  particularly  useful  approach  in  this  respect  is  rep¬ 
resenting  the  signal  via  a  set  of  adaptive  variable-size  shiftable  kernels  (much  smaller  in  size 
than  the  signal  itself).  We  studied  the  particularities  of  applying  such  an  approach  to  images. 
The  most  important  merit  of  our  method  (called  Point  Coding)  is  that  it  produces  a  very  efficient 
adaptive  code,  by  what  can  be  considered  a  direct  approach  towards  an  approximately  shift- 
invariant  representation.  This  is  especially  desirable  in  modeling  natural  or  artificial  encoding 
systems  necessarily  robust  to  signal  shifts,  such  as  the  visual  sensory  system.  A  significant  con¬ 
tribution  of  our  implementation,  is  that  both  the  encoding  and  the  learning  steps  are  performed 
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using  computationally  efficient  ( e.g .,  fast  and  superfast)  algorithms,  thus  allowing  a  practical  way 
to  attain  optimality. 

Robust  Coding.  Finally,  we  provide  a  detailed  mathematical  study  of  Robust  Coding  -  the  prob¬ 
lem  of  optimal  linear  coding  with  limited  precision  units.  We  show  how  to  characterize  optimal 
encoding  solutions  in  the  case  of  Gaussian  channel  noise  and  arbitrarily  many  encoding  units, 
and  derive  efficient  and  stable  algorithms  for  their  computation.  By  conveniently  expressing  the 
limit  of  optimization  as  the  closed-form  bound,  we  formally  explain  the  intuition  that  noisy  en¬ 
coding  units  can  preserve  signal  information  if  sufficiently  many  are  used  -  a  case  very  relevant 
to  modeling  neural  encoding  systems  such  as  the  retina. 


Future  Research  Directions 

The  research  topics  I  have  shortly  presented  here  describe  various  situations  when  existing  sig¬ 
nal  representations  are  improved  by  exploiting  either  the  theoretical  properties,  or  the  statistical 
structure  of  the  signals,  or  both.  These  ideas  can  be  extended  and  refined,  sometimes  with  far- 
reaching  consequences. 

Sparse  ICA.  Multiresolution  ICA  is  limited  in  that  the  representation  depends  highly  on  the 
associated  multiscale  transform;  moreover,  the  correspondence  between  the  subband  ICA  bases 
and  the  full-scale  optimal  basis  is  not  straightforward.  A  direct  approach  to  deriving  efficient 
adaptive  representations  for  large  images  can  exploit  an  observed  by-product  of  ICA  for  images: 
namely,  basis  elements  look  like  localized  oriented  edge  features  which  implies  that  the 
basis  matrix  is  sparse.  By  restricting  the  optimization  to  matrices  satisfying  some  (fixed)  spar¬ 
sity  pattern,  the  search  space  reduces  considerably.  In  addition  to  estimating  fewer  parameters, 
requiring  less  memory,  and  exploiting  fast  sparse-matrix  algorithms,  this  method  should  likely 
be  very  efficient,  comparable  to  unconstrained  ICA  in  this  respect.  The  idea  of  speculating  the 
structure  of  optimal  solutions  suggests  a  general  and  simple  recipe  for  designing  adaptive  repre¬ 
sentations  for  large  signals. 

Algebraic  Signal  Processing  Theory.  Rigorously  classifying  signal  transforms  and  their  algo¬ 
rithms  is  a  primary  goal  of  algebraic  signal  processing  theory.  The  DFT,  the  sixteen  trigonometric 
transforms,  and  many  other  linear  transforms  []98[l  proved  to  be  particular  cases  of  polynomial 
transforms  fit  by  the  general  theory  via  algebraic  matrix  structure.  An  exciting  direction  is  recon¬ 
ciling  the  two  apparently  divergent  views  -  deterministic  and  stochastic  -  on  signal  processing. 
The  first  steps  were  made  by  establishing,  for  instance,  conditions  for  the  equivalence  of  Gauss- 
Markov  random  fields  and  algebraic  signal  models,  which  connects  the  concept  of  Karhunen- 
Loeve  transform  to  the  Fourier  transform.  We  plan  to  push  these  ideas  further,  by  investigating 
what  are  the  correspondents  of  other  adaptive  linear  transforms  -  Robust  Coding,  ICA  -  in  the  al¬ 
gebraic  theory.  The  immediate  outcome  would  be  our  better  understanding  the  structure  of  these 
transforms.  On  the  long  term,  this  could  lead  to  discovering  more  efficient  algorithms,  which 
would  make  adaptive  representations  suitable  for  general-purpose  hardware  implementation. 
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Online  Matching  Pursuit.  A  direction  of  particular  interest  to  me  is  designing  efficient  algo¬ 
rithms  for  nonlinear,  greedy  approximations  of  time-varying  signals  (sound,  video).  Matching 
Pursuit  has  proven  to  be  a  very  practical  choice  for  spike  extraction,  in  the  case  of  shifted-kernel 
dictionaries  [jl07[].  To  apply  it  properly  though,  the  entire  signal  must  be  available  ahead  of  time; 
however,  in  some  situations  (e.g.,  recording  speech,  music,  or  video)  the  encoding  must  be  per¬ 
formed  in  real-time  and  in  and  online  manner.  The  main  bottleneck  in  Matching  Pursuit  is  not  as 
much  updating  the  residual,  as  it  is  selecting  the  next  atom  in  the  representation.  In  the  case  of  a 
small  set  of  kernels,  the  first  issue  is  easily  solvable.  As  for  the  second,  a  solution  is  to  only  fo¬ 
cus  on  a  small,  “sliding  window”  area  of  the  signal  and  thus  process  the  signal  in  a  quasi-online 
fashion.  The  ability  of  such  an  approach  in  producing  a  sparse  set  of  atoms  remains  open  for 
now,  yet  by  inspecting  partial  results  it  seems  comparable  to  that  its  offline  counterpart.  This 
approach  to  online  greedy  approximation  has  a  potentially  high  impact  not  only  within  signal 
processing,  but  also  in  other  research  fields.  For  example,  Spike  Coding  [|l()7j]  has  been  shown 
to  produce  a  representation  that  is  relevant  to  modeling  the  auditory  nerve.  However,  the  spikes 
in  the  representation  are  computed  in  an  offline  fashion,  unlike  the  real  neural  spikes.  We  expect 
that  an  online  greedy  approach  will  fix  this  shortcoming,  and  thus  improve  the  biological  rele¬ 
vance  of  Spike  Coding.  This  promising  idea  could  be  of  great  potential  help  in  manufacturing 
better  prosthetic  hearing  devices. 
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